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Appellants hereby submit an original and two copies of this Appeal Brief to the Board of Patent 
Appeals and Interferences ("the Board") in response to the Final Office Action mailed on 
October 22, 2002. The Notice of Appeal was timely submitted on February 2 1 , 2003, and was received 
in the Patent and Trademark Office ("the Office' , ) on February 28, 2003. This Appeal Brief is timely 
submitted in light of the concurrendy filed Petition for an Extension of Time of three months to and including 
July 28, 2003, and authorization to deduct the fee as required under 37 C.F.R. § 1.17(a)(3) from 
Appellants' Representatives' deposit account. The Commissioner is also authorized to charge the fee for 
filing this Appeal Brief ($160.00), as required under 37 C.F.R. § 1.17(c), to Lexicon Genetics 
Incorporated Deposit Account No. 50-0892. 

Appellants believe no fees in addition to the fee for filing the Appeal Brief and the fee for the 
extension of time are due in connection with this Appeal Brief. However, should any additional fees under 
37 C.F.R. §§ 1.16 to 1.21 be required for any reason related to this communication, the Commissioner 
i s authorized to charge any underpayment or credit any overpayment to Lexicon Genetics Incorporated 
Deposit Account No. 50-0892. 

I. REAL PARTY IN INTEREST 

The real party in interest is the Assignee, Lexicon Genetics Incorporated, 8800 Technology Forest 
Place, The Woodlands, Texas, 77381. 

II. RELATED APPEALS AND INTERFERENCES 

Appellants know of no related appeals or interferences. 



III. STATUS OF THE CLAIMS 

The present application was filed on January 26, 200 1 , claiming the benefit of U.S. Provisional 



Application Number 60/178,557, which was filed on January 26, 2000, and U.S. Provisional Application 
Number 60/199,5 13, which was filed on April 25, 2000, and included original claims 1-5. A Restriction 
and Election Requirement was set forth during a telephone interview between the Examiner and Appellants' 
representative David Hibler on February 8, 2002, separating the original claims into three separate and 
distinct inventions. During this telephone conference, Appellants provisionally elected without traverse the 
claims of the Group I invention (original claims 1-3) for prosecution on the merits. 

A First Official Action on the merits ("the First Action") was issued on April 23, 2002, in which 
the title and abstract of the application, the Oath/Declaration, and claims 1 and 2 were objected to, claims 
1-3 were rejected under 35 U.S.C. § 101 as allegedly lacking a patentable utility, claims 1-3 were rejected 
under 35 U.S.C. § 1 12, first paragraph, as allegedly unusable by the skilled artisan due to the alleged lack 
of patentable utility, claim 1 was rejected under 35 U.S.C. § 1 12, first paragraph, as allegedly lacking 
enablement for the full scope of the claimed invention, claim 1 was rejected under 35 U.S.C. § 1 12, first 
paragraph, as allegedly not described in the specification in such a way as to reasonably convey to one 
skilled in the relevant art that the inventor, at the time the application was filed, had possession of the 
claimed invention, claim 2 was rejected under 35 U.S.C. § 1 12, second paragraph, as allegedly indefinite, 
andclaim 1 was rejected under 35 U.S.C. § 102(b) as allegedly anticipated by Hillierer al (Accession 
Number AA069426). In a response to the First Official Action submitted to the Office on 
September 20, 2002 ("response to the First Action"), Appellants submitted a supplemental declaration, 
amended the title of the application, cancelled claims 4 and 5 without prejudice and without disclaimer as 
drawn to non-elected inventions, amended claims 1 and 2, added new claims 6-8, and addressed the 

rejections of claims 1-3. 

A Second and Final Official Action ("the Final Action") was mailed on October 22, 2002, 
indicating that the objections to the title and abstract of the application, the Oath/Declaration, and 
claims 1 and2, and the rejections of claiml under 35U.S.C. § 112, first paragraph, as allegedly lacking 
enablementforthefullscopeof theclaimed invention, claim 1 under 35U.S.C. § 112, first paragraph, as 
allegedly not described in the specification in such a way as to reasonably convey to one skilled in the 
relevant art that the inventor, at the time the application was filed, had possession of the claimed invention, 



claim 2 under 35 U.S.C. § 112, second paragraph, as allegedly indefinite, and claim 1 under 
35 U.S.C. § 102(b) as allegedly anticipated by Hillier et al (Accession Number AA069426), had been 
overcome by the amendments and remarks submitted in the response to the First Action, but maintaining 
the rejection of claims 1-3 (and newly added claims 6-8) under 35 U.S.C. § 101 as allegedly lacking a 
patentable utility and under 35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan 
due to the alleged lack of patentable utility. In a response to the Second and Final Office Action submitted 
on January 22, 2003 ("response to the Final Action"), Appellants again addressed the rejections of 
claims 1-3 and 6-8. An Advisory Action ("the Advisory Action") was mailed on February 19, 2003, 
maintaining the rejection of claims 1-3 and 6-8 under 35 U.S.C. § 101 as allegedly lacking a patentable 
utility and under 35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the 
alleged lack of patentable utility. Therefore, claims 1-3 and 6-8 are the subject of this appeal. A copy of 
the appealed claims are included below in the Appendix (Section DC). 

IV. STATUS OF THE AMENDMENTS 

As no amendments subsequent to the Final Action have been filed, Appellants believe that no 
outstanding amendments exist. 

V. SUMMARY OF THE INVENTION 

The present invention relates to Appellants' discovery and identification of novel human 
polynucleotide sequences that encode proteins sharing sequence similarity with animal neurexin proteins, 
particularly contactin associated proteins (see, at least, the specification at page 1 , lines 9-12 and page 15, 
lines 18-21). 

The presently claimed polynucleotide sequences were compiled from clustered human gene trapped 
sequences, ESTs, and cDNAs from human brain, fetal brain, cerebellum and hypothalamus cDNA libraries 
(specification at page 15, lines 15-17). 

The specification details a number of uses for the presently claimed polynucleotide sequences, 
including assessing gene expression patterns, particularly in diagnostic assays such as forensic analysis (see, 



for example, the specification at page 10, lines 15-19), in expression profiling using a high throughput "chip" 
format (specification at page 5, lines 12-14), in determining the genomic structure, for example through the 
identification of coding sequence, and mapping the sequences to a specific region of a human chromosome 
(specification at page 10, line 18). 

VI. ISSUES ON APPEAL 

1. Do claims 1-3 and 6-8 lack a patentable utility? 

2. Are claims 1-3 and 6-8 unusable by a skilled artisan due to a lack of patentable utility 

VII. GROUPING OF THE CLAIMS 

For the purposes of the outstanding rejections under 35 U.S. C. § 101 and35U.S.C. § 112, first 
paragraph concerning utility, the claims will stand or fall together. 

VIII. ARGUMENT 

A. Do Claims 1-3 and 6-8 Lack a Patentable Utility? 

The Final Action next rejects claims 1-3 and 6-8 under 35 U.S.C. § 101, as allegedly lacking a 
patentable utility due to not being supported by either a specific and substantial utility or a well-established 
utility. 

The Final Action admits that contactin associated proteins (casprs) have a "specific utility" (the 
Final Action at page 3), but that "Applicants have not provided any specific information regarding the 
specific utility of the proteins of the present invention which distinguishes them from other members of the 
neurexin superf amily" (the Final Action at page 3). Appellants respectfully point out that the presently 
claimed sequence is clearly referred to in the specification as originally filed as a contactin associated 
protein (see, at least, the specification at page 1, lines 9-12, and page 15, lines 18-21). Furthermore, as 
set forth in the Response to the First Action and the Response to the Final Action, two sequences sharing 
nearly 100% percent identity at the protein level over the entire length of the claimed sequence are present 
in the leading scientific repository for biological sequence data (GenBank), and have been annotated by 



third party scientists wholly unaffiliated with Appellants as "Homo sapiens caspr5 protein" (GenBank 
accession numbers NM_130773 (alignment and GenBank report provided in Exhibit A) and AB07788 1 
(alignment and GenBank report provided in Exhibit B)). The legal test for utility simply involves an 
assessment of whether those skilled in the art would find any of the utilities described for the invention to 
be credible or believable. Given these GenBank annotations, there can be no question that those skilled 
in the art would clearly believe that Appellants' sequence is a caspr protein. 

Furthermore, it is well-known in the art that caspr proteins are distinct members of the neurexin 
superfamily (see Poliak et al, Neuron 24:1037-1047, 1999, and Spiegel et al. ,Mol Cell. Neurosci. 
20:283-297, 2002; abstracts provided in Exhibit C). As a matter of law, it is well settled that a patent 
need not disclose what is well known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. Cir. 1988). 
Additionally, as described in the Response to the Final Action, the previously described caspr proteins 
(caspr 2, 3 and 4) share between 42% and 63% homology with each other (caspr 2 (GenBank accession 
number AAF25 199) vs. caspr 3 (GenBank accession number NP_387504), 42% identity (alignment 
provided in Exhibit D), caspr 2 vs. caspr 4 (GenBank accession number NP_207837), 48% identity 
(Exhibit E), and caspr 4 vs. caspr 3, 63% identity (alignment provided in Exhibit F)), but only 23% to 
26% homology to neurexins (neurexin 1,2 and 3); neurexin 1 (GenBank accession number NP_004792) 
vs. caspr 2, 24% identity (alignment provided in Exhibit G), neurexin 1 vs. caspr 3 , 23% identity (Exhibit 
H), neurexin 1 vs. caspr 4, 25% identity (alignment provided in Exhibit I), neurexin 2 (GenBank accession 
number NP_620060) vs. caspr 2, 24% identity (alignment provided in Exhibit J), neurexin 2 vs. caspr 3, 
25% identity (alignment provided in Exhibit K), neurexin 2 vs. caspr 4, 26% identity (alignment provided 
in Exhibit L), neurexin 3 (GenBank accession number CAC87720) vs. caspr 2, 26% identity (alignment 
provided in Exhibit M), neurexin 3 vs, caspr 3, 25% identity (alignment provided in Exhibit N), neurexin 
3 vs. caspr 4, 25% identity (alignment provided in Exhibit O). That Appellants claimed sequence is a 
caspr is further confirmed by the fact that Appellants sequence shares between 48% and 59% homology 
to the other caspr proteins (vs. caspr 2,51% identity (alignment provided in Exhibit P), vs. caspr 3, 48% 
identity (alignment provided in Exhibit Q), and vs, caspr 4, 59% identity (alignment provided in 
Exhibit R)), but only 24% to 26% homology to the neurexin proteins (vs. neurexin 1, 25% identity 



(alignment provided in Exhibit S), vs. neurexin 2, 24% identity (alignment provided in Exhibit T), and vs. 
neurexin 3 , 26% identity (alignment provided in Exhibit U)), perfectly in line with the previously established 
figures. Given these data, there can be no question that those skilled in the art would clearly believe that 
Appellants' sequence is a caspr protein, as opposed to "other members of the neurexin superfamily". As 
the Examiner admits that casprs have a specific utility, due to their association with "myelinated axons and 
potassium channels" (the Final Action at page 3), the claimed sequence clearly meets the requirements of 
35U.S.C. § 101. 

Nevertheless, the Advisory Action continues to question Appellants asserted utility, stating that 
"Applicants have only alleged that the protein of the present invention is a caspr based on homology to 
neurexins and casprs" (Advisory Action at page 2). Appellants respectfully point out that this is all that 
is required for the claimed sequence to meet the requirements of 35 U.S .C. § 101. The present situation 
directly tracks Example 1 0 of the Revised Interim Utility Guidelines Training Materials (pages 53-55 ; 
Exhibit V), which clearly establishes that a rejection under 35 U.S.C. § 101 as allegedly lacking a 
patentable utility and under 35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan 
due to the alleged lack of patentable utility (see Section VHI(B), below), is not proper when a full length 
sequence (such as the presently claimed sequence), and has a similarity score greater than 95% to a protein 
having a known function (such as the nearly 100% identity between the presently claimed sequence and 
the caspr 5 sequences, as discussed above). The Advisory Action concludes that "Applicants have not 
provided any definitive evidence or statement concluding that the protein of the invention is, in fact, a caspr 
protein as opposed to another member of the neurexin family of proteins" (Advisory Action at page 2). 
As discussed at length in the previous paragraph, this assertion is simply not true. Appellants have 
repeatedly provided evidence and stated for the record that the presently claimed sequence encodes a 
caspr protein. The Examiner seems to be focusing on Appellants statements in the specification that the 
presently claimed sequence "share sequence similarity with animal neurexin proteins and contactin 
associated proteins" (specification at page 1, lines 11-12) and share "similarity with a variety of proteins, 
including, but not limited to, neurexins (including secreted types) and contactin associated proteins" 
(specification at page 15, lines 19-21) as an admission that "Applicants have not provided any definitive 



evidence or statement concluding that the protein of the invention is, in fact, a caspr protein as opposed to 
another member of the neurexin family of proteins". However, the statements in the specification as 
originally filed are completely correct, in that as contactin associated proteins are well known to be 
members of the neurexin superfamily, the presently claimed sequence, as encoding a caspr protein, does 
in fact share similarity with neurexin proteins. Furthermore, bv specifically singling out that the presently 
claimed sequence shares similarity with contactin associated proteins, Appellants leave no doubt that the 
presently claimed sequence specifically is a caspr protein, as opposed to any of the other members of the 
neurexin superfamily. Thus, the Examiner's arguments in no way support the allegation that the presently 
claimed sequence lacks a patentable utility. 

Although not reiterated in the Final Action or in the Advisory Action, in the First Action the 
Examiner questioned Appellants' assertion that the presently claimed sequence encodes a caspr protein, 
citing a number of scientific articles to support this position. Although this issue has been overcome above, 
in an abundance of caution Appellants wish to take this opportunity to refute the points raised by the 
Examinerin the First Action. The First Action cited an article by Skolnickef al ("Skolnick"; 2000, Trends 
in Biotech. 1 8 :34-39) for the proposition that "(k)nowing the protein structure by itself is insufficient to 
annotate a number of functional classes and is also insufficient for annotating the specific details of protein 
function" (Skolnick at page 36, emphasis added). However, Skolnick concerns predicting protein function 
not by overall amino acid homology to other family members, but instead concerns prediction of function 
based on the presence of certain functional "motifs" present within a given protein sequence. Thus, 
Skolnick does not apply to the current situation, where overall protein homology is used to assign function 
to a particular sequence. However, even in the event that Skolnick is applicable, Skolnick itself concludes 
that "sequence-based approaches to protein-function prediction have proved to be very useful" (Skolnick 
at page 37), admitting that such methods have correctly assigned function in 50-70% of the cases, thus 
arguing against the conclusion drawn by the Examiner in the First Action. 

The Examiner next cited Bork (Genome Research 70:398-400, 2000) as supporting the 
proposition that prediction of protein function from homology information is somewhat unpredictable. 
However, nowhere in Bork is there a comparison of the prediction accuracy based on the percentage 



homology between two proteins or two classes of proteins, and thus does not support the alleged lack of 
utility for the present invention. Additionally, Bork concludes that "there is still no doubt that sequence 
analysis is extremely powerful" (Bork at page 400), also arguing against the conclusion drawn by the 
Examiner in the First Action. 

The Examiner next cited Doerks et al (Trends in Genetics 74:248-250, 1998) for the proposition 
that sequence-to-function methods of assigning protein function are prone to errors. However, 
Doerks et al. states that "utilization of family information and thus a more detailed characterization" should 
lead to " simplification of update procedures for the entire families if functional information becomes 
available for at least one member " (Doerks et al. , page 248, paragraph bridging columns 1 and 2, emphasis 
added). Appellants point out that, as detailed above, two sequences sharing nearly 100% percent identity 
at the protein level with the claimed sequence are present in the leading scientific repository for biological 
sequence data (GenBank), and have been annotated by third party scientists wholly unaffiliated with 
Appellants as "Homo sapiens caspr5 protein" (GenBank accession numbers NM_130773 (Exhibit A) 
and AB077881 (Exhibit B)). The caspr protein family is a well-studied protein family with known 
functional information, exactly the situation that Doerks et al. suggests will ' 'simplify ' and "avoid the pitfalls' ' 
of previous sequence-to-function methods of assigning protein function (Doerks et al, page 248, 
columns 1 and 2). Thus, instead of supporting the Examiner' s position against utility, Doerks et al actually 
supports Appellants' position that the presently claimed sequences have a substantial and credible utility. 

The Examiner next cited Smith etal (Nature Biotechnology 75: 1222-1223, 1997) as teaching 
"that there are numerous cases in which proteins of very different functions are homologous" (the First 
Action at page 6). However, the Smith and Zhang article also states "the major problems associated with 
nearly all of the current automated annotation approaches are - paradoxically - minor database annotation 
inconsistencies (and a few outright errors)" (page 1222, second column, first paragraph, emphasis added). 
Thus, Smith and Zhang do not in fact seem to stand for the proposition that prediction of function based 
on homology is fraught with uncertainty, and thus also does not support the alleged lack of utility. The 
citation of Pilbeam etal ("Pilbeam"; 1993,Bone 14:717-720), which allegedly details that "PTH and 
PTHrP arc two structurally closely related proteins which can have opposite effects on bone resorption" 



(the First Action at page 6), is also hardly indicative of a high level of uncertainty in assigning function based 
on sequence. In fact, Pilbeam details that "the biological activities of hPTHrP 1-34 and synthetic bPTH 
1-34 have generally been shown to be qualitatively similar" (Pilbeam at page 717), and thus also does not 
support the alleged lack of utility. 

The Examiner next cited Brenner ( TIG 15: 132-133, 1999) as teaching that "most homologs must 
have different molecular and cellular functions" (the First Action at page 6). However, this statement is 
based on the assumption that "if there are only 1000 superfamilies in nature, then most homologs must have 
different molecular and cellular functions" (Brenner, page 132, second column). Furthermore, Brenner 
suggests that one of the main problems in using homology to predict function is "an issue solvable by 
appropriate use of modern and accurate sequence comparison procedures" (Brenner, page 132, second 
column), and in fact references an article by Altschul et al , which is the basis for one of the "modern and 
accurate sequence comparison procedures" used by Appellants. Thus, the Brenner article also does not 
support the alleged lack of utility. 

The Examiner finally cited Bork et al (Trends in Genetics 72:425-427, 1996) as supporting the 
proposition that prediction of protein function from homology information is somewhat unpredictable, based 
on the "structural similarity of a small domain of the new protein to a small domain of a known protein" (the 
First Action at page 3). Thus, the Examiner's reliance on Bork et al has the same failing as described 
above for Doerks et al , specifically, the assumption that Appellants assertion that the present sequences 
are caspr proteins are made on the basis of structural similarity of a small domain of the new protein to a 
small domain of a known protein. Appellants again would like to invite the Board' s attention to the fact 
that two sequences sharing nearly 100% percent identity at the protein level with the claimed sequence are 
present in the leading scientific repository for biological sequence data (GenBank), and have been 
annotated by third party scientists wholly unaffiliated with Appellants as "Homo sapiens casprS protein" 
(GenBank accession numbers NM_130773 (Exhibit A) and AB077881 (Exhibit B)). Thus, Appellants 
assertion that the present sequences are caspr proteins are not made on the basis of "structural similarity 
of a small domain of the new protein to a small domain of a known protein", but rather vast homology over 
the entire sequence. Thus, Bork et al also does not support the alleged lack of utility for the present 



invention. 

Thus, while Appellants have provided evidence of record that conclusively establishes that those 
skilled in the art would believe that the specifically claimed sequence encodes a caspr protein, the Examiner 
has provided no evidence that directly establishes that the specifically claimed sequence does not encode 
a caspr protein. Accordingly, the evidence of record compels a finding that the present invention has a 
patentable utility. 

Furthermore, with regard to the citation of journal articles to support an allegation of a lack of utility, 
the PTO has repeatedly attempted to deny the utility of nucleic acid sequences based on a small number 
of publications that call into doubt prediction of protein function from homology information and the 
usefulness of bioinformatic predictions, of which these articles are merely the latest examples. Appellants 
readily agree that there is not 100% consensus within the scientific community regarding prediction of 
protein function from homology information, and further agree that prediction of protein function from 
homology information is not 100% accurate. However, Appellants respectfully point out that the lack of 
100% consensus on prediction of protein function from homology information is completely irrelevant 
to the question of whether the claimed nucleic acid sequence has a substantial and specific utility, and that 
100% accuracy of prediction of protein function from homology information is not the standard for 
patentability under 35 U.S.C. § 101 . Appellants respectfully point out that, as discussed above, the legal 
test for utility simply involves an assessment of whether those skilled in the art would find any of the utilities 
described for the invention to be believable . Appellants submit that the overwhelming majority of those 
of skill in the relevant art would believe prediction of protein function from homology information and the 
usefulness of bioinformatic predictions to be powerful and useful tools, as evidenced by hundreds if not 
thousands of journal articles (which Appellants will submit to the Office if the Board truly doubts 
Appellants' assertion that the overwhelming majority of those of skill in the art place a high value on 
prediction of protein function from homology information and the usefulness of bioinformatic predictions), 
and would thus believe that Appellants sequence is a caspr protein. As believabilitv is the standard for 
meeting the utility requirement of 35 U.S.C. § 101, and not 100% consensus or 100% accuracy, 
Appellants submit that the present claims must clearly meet the requirements of 35 U.S.C. § 101. 
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Furthermore, the PTO itself does not require 100% identity between proteins to establish 
functional homology. Example 10 of the Revised Interim Utility Guidelines Training Materials, discussed 
above, only requires a similarity score greater than 95% to establish functional homology. Thus, scientific 
publications that generally assert that very small changes between amino acid sequences can lead to 
changes in function, or publications describing specific examples of proteins, distinct from Appellants 
sequence, where a minor change in amino acid sequence has lead to a change in function, have been 
viewed by the PTO itself as irrelevant to the question of utility, and thus do not support the Examiner 5 s 
allegation that the presently claimed sequence lacks utility. Therefore, the present utility rejection must fail 
as a matter of policy, as a matter of science, and as a matter of law. 

However, although Appellants need only make one credible assertion of utility to meet the 
requirements of 35 U.S.C. § 101 (Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); In re Gottlieb, 
140 USPQ 665 (CCPA 1964); In re Malachowski, 189 USPQ 432 (CCPA 1976); Hoffman v. Klaus, 
9 USPQ2d 1657 (Bd. Pat. App. & Inter. 1988)), as yet another example of the utility of the present 
sequence, Appellants pointed out both in the Response to the First Action and in the Response to the Final 
Action that the present nucleic acid sequences have utility in forensic analysis (see, for example, the 
specification at page 10, lines 15-19). As described in the specification at page 15, lines 21-25, the 
present sequences define a coding single nucleotide polymorphism - specifically, a C/T polymorphism at 
position 8 1 2 of SEQ ID NO : 1 , which can lead to a serine or leucine residue at amino acid position 27 1 
of SEQ ID NO:2. As such polymorphisms are the basis for forensic analysis, which in undoubtedly a "real 
world" utility, the present sequences must in themselves be useful. 

In the Final Action, the Examiner questioned this asserted utility, stating that "any polynucleotide 
containing a SNP can be used for diagnostic assays" (the Final Action at page 3). The Examiner seems 
to be confusing the requirements of a specific utility with a unique utility. The fact that other polymorphic 
markers have been identified in other genetic loci, or that the use of the presently described polymorphic 
markers will provide additional information concerning the prevalence of these markers in certain 
subpopulations, does not mean that use of the polymorphic markers identified by Appellants' in SEQ ID 
NO : 1 is not a specific utility. As clearly stated by the Federal Circuit in Carl Zeiss Stiftung v. Renishaw 
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PLC, 20 USPQ2d 1101 (Fed. Cir. 1991): 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding a 
lack of utility." Envirotech Corp, v. Al George, Inc., 221 USPQ 473, 480 (Fed. Cir. 
1984) 

In other words, just because other (possibly better) polymorphic markers from the human genome have 
been described, or that additional information about the presently described polymorphic markers can be 
gained through the use of these markers, does not establish that the presently described polymorphic 
markers lack a specific utility. The requirement for a specific utility, which is part of the standard for utility 
under 35 U.S.C. § 101 presently being applied by the Office, should not be confused with the requirement 
for a unique utility, which is not the legal standard. If every invention were required to have a unique utility, 
the Patent and Trademark Office would no longer be issuing patents on batteries, automobile tires, golf 
balls, golf clubs, and treatments for a variety of human diseases, just to name a few particular examples, 
because other examples of each of these have already been described and patented. However, only the 
briefest perusal of virtually any issue of the Official Gazette provides numerous examples of patents being 
granted on each of the above compositions every week . Furthermore, if each invention needed to have 
a unique utility in order to be patented, the entire class and subclass system would be an effort in futility, 
as the class and subclass system serves solely to group such common inventions, which would not be 
required if each invention needed to have a unique utility. In view of the above standards and ' 'common 
sense" analysis, there can be little question that the present sequence clearly meets the requirements of 
35 U.S.C. § 101. 

The Final Action states that "without knowing the functions (i .e. utility) of the polynucleotide and 
protein of the present invention, one cannot assess a utility for the diagnostic assays using these molecules" 
(the Final Action at page 3). Appellants respectfully submit that the Examiner has completely missed 
Appellants point with regard to the use of the presently described polymorphism in forensic analysis. In 
forensic analysis, polymorphic markers, such as the presently described polymorphism, can be used by 
those skilled in the art to distinguish one person from another based on the presence or absence of the 
described polymorphism. Forensic analysis requires no information regarding the function of the protein 
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encoded by the polymorphic DNA sequence. The Examiner has provided no evidence of record that 
establishes that skilled artisans would not be able to use the presently described polymorphism in forensic 
analysis exactly as it was described in the specification as originally filed, without any additional research. 
It is important to note that simply because the use of this polymorphic marker will necessarily provide 
additional information on the percentage of particular subpopulations that contain this polymorphic marker 
does not mean that additional research is needed in order for this marker as it is presently described in the 
instant specification to be used in forensic science. Thus, the Examiner has failed to meet his evidentiary 
burden of proving that the present invention lacks utility. 

This is also not a case of a probable utility. Appellants point out that even in the worst case 
scenario, the described polymorphism is useful to distinguish 50% of the population (in other words, the 
marker being present in half of the population). Appellants point out that the ability of a polymorphic 
marker to distinguish at least 50% of the population is an inherent feature of any polymorphic marker, and 
this feature is well understood by those of skill in the art. Appellants note that as a matter of law, it is well 
settled that a patent need not disclose what is well known in the art. In re Wands, supra. Appellants 
respectfully point out that all that is required to support Appellants assertion of utility is for the skilled artisan 
to believe that the presently described polymorphic marker could be useful in forensic analysis. The fact 
that forensic biologists use polymorphic markers such as that described by Appellants everyday provides 
more that ample support for the assertion that forensic biologists would also be able to use the specific 
polymorphic marker described by Appellants in the same fashion. Therefore, these allegations are 
completely without merit, and in no way establish that the present invention lacks utility. 

Further, as the presently described polymorphisms are part of the family of polymorphisms that 

have a well established utility, Appellants reliance on In re Brana, (34 USPQ2d 1436 (Fed. Cir. 1995), 

"Brand") is directly on point. In Brana, the Federal Circuit admonished the Patent and Trademark Office 

for confusing "the requirements under the law for obtaining a patent with the requirements for obtaining 

government approval to market a particular drug for human consumption". Branaat 1442. The Federal 

Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical inventions, 
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what must the applicant provide regarding the practical utility or usefulness of the invention 
for which patent protection is sought. This is not a new issue: it is one which we would 
have thought had been settled by case law years ago . 

Brana at 1439, emphasis added. The choice of the phrase "utility or usefulness" in the foregoing quotation 
is highly pertinent. The Federal Circuit is evidently using "utility" to refer to rejections under 
35U.S.C. § 101, andis using "usefulness" to referto rejections under 35U.S.C. § 112, first paragraph. 
This is made evident in the continuing text in Brana, which explains the correlation between 35 U.S.C. 
§§ 101 and 112, first paragraph. The Federal Circuit concluded: 

* 

FDA approval, however, is not a prerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context of 
pharmaceutical inventions, necessarily includes the expectation of further research and 
development . The stage at which an invention in this field becomes useful is well before 
it is ready to be administered to humans. Were we to require Phase II testing in order to 
prove utility, the associated costs would prevent many companies from obtaining patent 
protection on promising new inventions, thereby eliminating an incentive to pursue, through 
research and development, potential cures in many crucial areas such as the treatment of 
cancer. 

Brana at 1442-1443, citations omitted, emphasis added. As set forth above, the present polymorphism 
is useful in forensic analysis exactly as it is described in the specification as originally filed, without the need 
for any further research . However, even if, arguendo, further research might be required in certain aspects 
of the present invention, this does not preclude a finding that the invention has utility, as set forth by the 
Federal Circuit's holding in Brana, which clearly states, as highlighted in the quote above, that 
"pharmaceutical inventions, necessarily includes the expectation of further research and development " 
{Brana at 1442-1443, emphasis added). In assessing the question of whether undue experimentation 
would be required in order to practice the claimed invention, the key term is "undue", not 
"experimentation". In re Angstadt and Griffin, 190 USPQ 214 (CCPA 1976). The need for some 
experimentation does not render the claimed invention unpatentable. Indeed, a considerable amount of 
experimentation may be permissible if such experimentation is routinely practiced in the art. In re Angstadt 
and Griffin, supra; Amgen, Inc. v. Chugai Pharmaceutical Co., Ltd, 18 USPQ2d 1016 (Fed. Cir. 
199 1). As a matter of law, it is well settled that a patent need not disclose what is well known in the art. 
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In re Wands, supra. 

As yet another example of the utility of the present sequence, Appellants pointed out in the 
response to the First Action that those of skill in the art would readily appreciate the importance of tracking 
the expression of the gene encoding the described protein, as described in the specification as originally 
filed, at least at page 5, lines 12-14. In particular, the specification describes how the described sequences 
can be represented using a gene chip format to provide a high throughput analysis of the level of gene 
expression. Such "DNA chips" clearly have utility, as evidenced by hundreds of issued U.S. Patents, as 
exemplified by U.S. Patent Nos. 5,445,934 (Exhibit W), 5,556,752 (ExhibitX), 5,744,305 (Exhibit Y), 
5,837,832 (ExhibitZ), 6,156,501 (Exhibit AA) and 6,261,776 (ExhibitBB). Appellants point out that 
expression profiling does not require a knowledge of the function of the particular nucleic acid on the chip - 
rather the gene chip indicates which DNA fragments are expressed at greater or lesser levels in two or 
more particular tissue types. 

Evidence of the ' 'real world' ' substantial utility of the present invention is further provided by the fact 
that there is an entire industry established based on the use of gene sequences or fragments thereof in a 
gene chip format. Perhaps the most notable gene chip company is Affymetrix. However, there are many 
companies that have, at one time or another, concentrated on the use of gene sequences or fragments, in 
gene chip and non-gene chip formats, for example: Gene Logic, ABI-Perkin-Elmer, HySeq and Incyte. 
In addition, one such company (Rosetta Inpharmatics) was viewed to have such "real world" value that it 
was acquired by large a pharmaceutical company (Merck) for significant sums of money (net equity value 
of the transaction was $620 million). The "real world" substantial industrial utility of gene sequences or 
fragments would, therefore, appear to be widespread and well established. Clearly, there can be no doubt 
that the skilled artisan would know how to use the presently claimed sequences (see Section VIII(B), 
below), strongly arguing that the claimed sequences have utility. Given the widespread utility of such "gene 
chip' ' methods using public domain gene sequence information , there can be little doubt that the use of the 
presently described novel sequences would have great utility in such DNA chip applications. As the 
present sequences are specific markers of the human genome (see below), and such specific markers are 
targets for the discovery of drugs that are associated with human disease, as described above, those of skill 
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in the art would instantly recognize that the present nucleotide sequences would be ideal, novel candidates 
for assessing gene expression using such DNA chips. Clearly, compositions that enhance the utility of such 
DNA chips, such as the presently claimed nucleotide sequences, must in themselves be useful. Thus, the 
present claims clearly meet the requirements of 35 U.S.C. § 101. 

The Examiner also dismisses this assertion of utility, stating that "any nucleotide sequence can be 
used in such an assay" (the Final Action at page 4). Appellants first point out that the present sequence, 
which has been biologically validated to be expressed, has a much greater utility than sequences that are 
merely predicted to be expressed based on bioinformatic analysis. Second, not "any nucleotide sequence" 
can be used to track gene expression, but rather, only those small percentage of nucleotide sequences that 
are expressed can be used in such a manner. Third, the Examiner again seems to be confusing the 
requirements of a specific utility with a unique utility. The fact that other nucleotide sequences can be used 
to track gene expression does not mean that the use of Appellants' sequence to track gene expression is 
not a specific utility (Carl Zeiss Stiftung v. Renishaw PLC, supra). Therefore, this argument completely 
fails to support the alleged lack of utility of the presently claimed compositions. 

Clearly, persons of skill in the art, as well as venture capitalists and investors, readily recognize the 
utility, both scientific and commercial, of genomic data in general, and specifically human genomic data. 
Billions of dollars have been invested in the human genome project, resulting in useful genomic data (see, 
e.g., Venter etal, 2001, Science 291: 1304; Exhibit CC). The results have been a stunning success as 
the utility of human genomic data has been widely recognized as a great gift to humanity (see, e.g., Jasny 
and Kennedy, 200 1 , Science 291 : 1 1 53 ; Exhibit DD). Clearly, the usefulness of human genomic data, 
such as the pnesendy claimed nucleic acid molecules, is substantial and credible (worthy of billions of dollars 
and the creation of numerous companies focused on such information) and well-established (the utility of 
human genomic information has been clearly understood for many years). 

Additionally, as set forth by Appellants in the response to the First Action, and as described in the 
specification as originally filed, at least at page 10, line 1 8, the present nucleotide sequence has a specific 
utility in determining the genomic structure of the corresponding human chromosome, for example mapping 
the protein encoding regions. The claimed polynucleotide sequences defining how the encoded exons are 
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actually spliced together to produce an active transcript (i.e., the described sequences are useful for 
functionally defining exon splice-junctions). This is evidenced by the fact that SEQ ID NO: 1 can be used 
to map the 24 coding exons on chromosome 2 (present within six overlapping chromosome 2 clones; 
GenBank Accession Numbers AC097715, AC019105, AC019159, AC104648, AC074362 and 079154 
alignments and the first page from the GenBank reports are presented in Exhibit EE). In disclosing 
biologically validated exon splice junctions, the claimed sequence provides physical evidence that effectively 
trumps the hypothetical conclusions provided by bioinformatics analysis of the corresponding genomic 
region conducted without supporting physical data. Thus, the claimed sequence clearly meet the 
requirements of 35 U.S.C. § 101. 

Appellants respectfully remind the Board that only a minor percentage of the genome (2-4%) 
actually encodes exons, which in-tum encode amino acid sequences. The presently claimed polynucleotide 
sequence provides biologically validated empirical data (e.g., showing which sequences are transcribed, 
spliced, and polyadenylated) that specifically define that portion of the corresponding genomic locus that 
actually encodes exon sequence. Appellants respectfully submit that the practical scientific value of 
expressed, spliced, and polyadenylated mRNA sequences is readily apparent to those skilled in the relevant 
biological and biochemical arts. The specification details that "sequences derived from regions adjacent 
to the intron/exon boundaries of the human gene can be used to design primers for use in amplification 
assays to detect mutations within the exons, introns, splice sites (e.g. , splice acceptor and/or donor sites), 
etc., that can be used in diagnostics and pharmacogenomics" (specification at page 10, lines 19-24). Thus, 
the present claims clearly meet the requirements of 35 U.S.C. § 101. 

Thus, as set forth in the Response to the First Action, the present nucleotide sequence has a specific 
utility in mapping the claimed sequence to the corresponding human chromosome, specifically 
chromosome 2, as described above. Clearly, the present polynucleotide provides exquisite specificity in 
localizing the specific region of human chromosome 2 that contains the gene encoding the given 
polynucleotide, a utility not shared by virtually any other nucleic acid sequences. In fact, it is this specificity 
that makes this particular sequence so useful. Early gene mapping techniques relied on methods such as 
Giemsa staining to identify regions of chromosomes. However, such techniques produced genetic maps 
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with a resolution of only 5 to 10 megabases, far too low to be of much help in identifying specific genes 
involved in disease. The skilled artisan readily appreciates the significant benefit afforded by markers that 
map a specific locus of the human genome, such as the present nucleic acid sequence. For further evidence 
in support of the Appellants' position, the Board is requested to review, for example, section 3 of Venter 
et al {supra, at pp. 1 3 1 7- 1 32 1 , including Fig. 1 1 at pp. 1 324- 1 325 ; see Exhibit CC), which demonstrates 
the significance of expressed sequence information in the structural analysis of genomic data. The presently 
claimed polynucleotide sequence defines a biologically validated sequence that provides a unique and 
specific resource for mapping the genome essentially as described in the Venter et aL article Thus, the 
present claims clearly meet the requirements of 35 U.S.C. § 101. 

The Examiner also questions this asserted utility, stating that "any nucleotide sequence can be used 
in such an assay" (the Final Action at page 4). Appellants first point out that only those small percentage 
of nucleotide sequences that are located in this region of chromosome 2 can be used in such a manner. 
Second, the Examiner again seems to be confusing the requirements of a specific utility with a unique 
utility. The fact that a small number of other nucleotide sequences could be used to map the protein coding 
regions in this specific region of chromosome 2 does not mean that the use of Appellants' sequence to map 
the protein coding regions of chromosome 2 is not specific (Carl Zeiss Stiftung v. Renishaw PhC, 
supra). 

Importantly, it has been clearly established that a statement of utility in a specification must be 

accepted absent reasons why one skilled in the art would have reason to doubt the objective truth of such 

statement. In re hanger, 503 F.2d 1380, 1391, 183 USPQ 288, 297 (CCPA, 1974; "Langer")\ In re 

Marzocchi, 439 F.2d 220, 224, 169 USPQ 367, 370 (CCPA, 1971). As clearly set forth in hanger. 

As a matter of Patent Office practice, a specification which contains a disclosure of utility 
which corresponds in scope to the subject matter sought to be patented must be taken as 
sufficient to satisfy the utility requirement of § 101 for the entire claimed subject matter 
unless there is a reason for one skilled in the art to question the objective truth of the 
statement of utility or its scope. 

hanger at 297, emphasis in original. As set forth in the MPEP, "Office personnel must provide evidence 
sufficient to show that the statement of asserted utility would be considered 'false' by a person of ordinary 
skill in the art" (MPEP, Eighth Edition at 2100^0, emphasis added). Thus, the present claims clearly meet 
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the requirements of 35 U.S.C. § 101. 

Regarding the utility requirements under 35 U.S.C. § 10 1 , the Federal Circuit has clearly stated 
"(t)he threshold of utility is not high: An invention is 'useful' under section 101 if it is capable of providing 
some identifiable benefit." Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 1364, 51 USPQ2d 1700 
(Fed. Cir. 1999) (citing Brenner v. Manson, 383 U.S. 519, 534 (1966)). Additionally, the Federal Circuit 
has stated that "(t)o violate § 101 the claimed device must be totally incapable of achieving a useful result." 
Brooktree Corp. v. Advanced Micro Devices, Inc., 977F.2d 1555, 1571,24USPQ2d 1401 (Fed. Cir. 
1992), emphasis added. Cross v. Iizuka (753 F.2d 1040, 224 USPQ 739 (Fed. Cir. 1985); "Cross") 
states "any utility of the claimed compounds is sufficient to satisfy 35 U.S.C. § 101". Cross at 748, 
emphasis added. Indeed, the Federal Circuit recently emphatically confirmed that "anything under the sun 
that is made by man" is patentable (State Street Bank & Trust Co. v. Signature Financial Group Inc. , 
149 F.3d 1368, 47 USPQ2d 1596, 1600 (Fed. Cir. 1998), citing the U.S. Supreme Court's decision in 
Diamond vs. Chakrabarty, 447 U.S. 303, 206 USPQ 193 (U.S., 1980)). Thus, based on the relevant 
case law, the present claims clearly meet the requirements of 35 U.S.C. § 101. 

Finally, While Appellants are well aware of the new Utility Guidelines set forth by the USPTO, 
Appellants respectfully point out that the current rules and regulations regarding the examination of patent 
applications is and always has been the patent laws as set forth in 35 U.S.C. and the patent rules as set 
forth in 37 C.F.R., not the Manual of Patent Examination Procedure or particular guidelines for patent 
examination set forth by the USPTO. Furthermore, it is the job of the judiciary, not the USPTO, to 
interpret these laws and rules. Appellants are unaware of any significant recent changes in either 
35U.S.C. § 101, orin the interpretation of 35U.S.C. § 101 by the Supreme Court or the Federal Circuit 
that is in keeping with the new Utility Guidelines set forth by the USPTO. This is underscored by numerous 
patents that have been issued over the years that claim nucleic acid fragments that do not comply with the 
new Utility Guidelines. As examples of such issued U.S. Patents, the Board is invited to review U.S. Patent 
Nos. 5,817,479 (Exhibit FF), 5,654,173 (Exhibit GG), and 5,552,281 (Exhibit HH; each of which 
claims short polynucleotides), and recently issued U.S. Patent No. 6,340,583 (Exhibit II; which includes 
no working examples), none of which contain examples of the "real-world" utilities that the Examiner seems 



-19- 



to be requiring. As issued U.S. Patents are presumed to meet all of the requirements for patentability, 
including 35 U.S.C. §§ 101 and 1 12, first paragraph (see Section VIII(B), below), Appellants submit that 
the present polynucleotides must also meet the requirements of 35 U.S.C. § 101. While Appellants agree 
that each application is examined on its own merits, Appellants are unaware of any changes to 
35 U.S.C. §101, or in the interpretation of 35 U.S.C. §101 by theSupreme Court or the Federal Circuit, 
since the issuance of these patents that render the subject matter claimed in these patents, which is similar 
to the subject matter in question in the present application, as suddenly non-statutory or failing to meet the 
requirements of 35 U.S.C. § 101. Thus, holding Appellants to a different standard of utility would be 
arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons, Appellants submit that the rejection of claims 1 -3 and 6-8 under 
35 U.S.C. § 101 must be overruled. 

B. Are Claims 1-3 and 6-8 Unusable Due to a Lack of Patentable Utility? 

The Final Action next rejects claims 1-3 under 35 U.S.C. § 1 12, first paragraph, since allegedly 
one skilled in the art would not know how to use the invention, as the invention allegedly is not supported 
by either a clear asserted utility or a well-established utility. 

The arguments detailed above in Section VIII(A) concerning the utility of the presently claimed 
sequences are incorporated herein by reference. As the Federal Circuit and its predecessor have 
determined that the utility requirement of Section 101 and the how to use requirement of Section 112, first 
paragraph, have the same basis, specifically the disclosure of a credible utility (In re Brana, supra; In re 
Jolles, 628 F.2d 1322, 1326 n.ll, 206 USPQ 885, 889 n.ll (CCPA 1980); In re Fouche, 439 F.2d 
1237, 1243, 169 USPQ 429, 434 (CCPA 1971)), Appellants submit that as claims 1-3 and 6-8 have 
been shown to have "a specific, substantial, and credible utility", as detailed in Section VIII(A) above, the 
present rejection of claims 1-3 and 6-8 under 35 U.S.C. § 112, first paragraph, cannot stand. 

Appellants therefore submit that the rejection of claims 1-3 and 6-8 under 35 U.S.C. § 112, first 
paragraph, must be overruled. 
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IX. APPENDIX 

The claims involved in this appeal are as follows: 

1 . (Amended) An isolated nucleic acid molecule comprising the nucleotide sequence of SEQ ID 

NO:l. 

2. (Amended) An isolated nucleic acid molecule comprising a nucleotide sequence that: 

(a) encodes the amino acid sequence of SEQ ID NO:2; and 

(b) hybridizes to the complement of the nucleotide sequence of SEQ ID NO : 1 under 
highly stringent conditions of 0.5 M NaHP0 4 , 7% sodium dodecyl sulfate (SDS) 
and 1 mM EDTA at 65°C and washing in O.lx SSC/0.1%SDS at 68°C. 

3 . An isolated nucleic acid molecule comprising a nucleotide sequence that encodes the amino 
acid sequence shown in SEQ ID NO:2. 

6. A recombinant expression vector comprising the isolated nucleic acid molecule of claim 3. 

7 . The recombinant expression vector of claim 6, wherein the nucleic acid molecule comprises 
the nucleotide sequence of SEQ ID NO:l. 

8. A host cell comprising the recombinant expression vector of claim 6. 
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X. CONCLUSION 

Appellants respectfully submit that, in light of the foregoing arguments, the Final Action's conclusion 
that claims 1 -3 and 6-8 lack a patentable utility and are unusable by the skilled artisan due to a lack of 
patentable utility, are unwarranted. It is therefore requested that the Board overturn the Final Action ' s 
rejections. 
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STATUTES 



35 U.S.C. § 101 2-4, 6, 10-12, 14, 16-20 



35 U.S.C. § 102 2, 3 



35 U.S.C. § 112 2-4,6, 14,20 



>NM_130773 ACCESSION: NM_130773 NID: gi 20544138 ref NM_130773.2 Homo 

sapiens casprS protein (casprS), transcript variant 1, 
mRNA 
Length = 5284 

Score = 2567 bits (6581), Expect = 0.0 

Identities = 1303/1307 (99%), Positives = 1303/1307 (99%) , Gaps = 3/1307 (0%) 
Frame = +2 

Query: 1 MDSLPRLTSVTjTLLFSGLWHLGLTATNYNCDDPIiASLLSPMAFSSSSDLTGTHSPAQLNW 60 

MDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFSSSSDLTGTHSPAQLNW 
Sbjct: 365 MDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFSSSSDLTGTHSPAQLNW 544 

Query: 61 RVGTGGWSPADSNAQQWLQMDLGISTRVEITAVATQGRYGSSDWVTSYSLMFSDTGRNWKQY 120 

RVGTGGWSPADSNAQQV^QMDLGNRVEITAVATQGRYGSSDWVTSYSLMFSDTGRNWKQY 
Sbjct: 545 RVGTGGWSPADSNAQQWLQMDLGNRVEITAVATQGRYGSSDWVTSYSLMFSDTGRNWKQY 724 

Query: 121 KQEDSIWTFAGNMNADSVVHHKLLHSV^ 180 

KQEDSIWTFAGNMNADSVVHHKLLHSVRARFV^FVPLEWNPSGKIGMRVEW 

Sbjct: 725 KQEDSIWTFAGNMNADSVVHHKLLHSVRARFV11FVPLEWNPSGKIGMRV 904 

Query: 181 ADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLELQKGRLAL 240 

ADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLELQKGRLAL 
Sbjct : 905 ADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLELQKGRLAL 1084 

Query: 241 HLNLGDSKARLSSSLPSATLGSLLDDQHWH-VLIERVGKQWFTVDKHTQHFRTKGETD 299 

HLNLGDSKARLSSSLPSATLGSLLDDQHWH VLIERVGKQVNFTVDKHTQHFRTKGETDA 
Sbjct: 1085 HLNLGDSKARLSSSLPSATLGSLLDDQHWHSVL 1264 

Query: 300 LDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLYYNGWII-LiVKRRKHQIYTVGNVTFS 358 

LDI D YEL S FGG I PVPGKPGTFLKKNFHGC I ENLYYNGVNI I LAKRRKHQIYT GNVTFS 
Sbjct: 1265 LDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLYYNGVNIIDLAKRRKHQIYT-GNVTFS 1441 

Query: 359 CSEPQIVPITF-NSSGSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEGSGTLLLS 417 

CSEPQIVPITF NSSGSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEGSGTLLLS 
Sbjct: 1442 CSEPQIVPITFVNSSGSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEGSGTLLLS 1621 

Query: 418 LEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAPPAPDSTW 477 

LEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAPPAPDSTW 
Sbjct: 1622 LEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAPPAPDSTW 1801 

Query: 478 VQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRLIFIDNQPKDLISVQQGSLGNFSDL 537 

VQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRLIFIDNQPKDLISVQQGSLGNFSDL 
Sbjct: 1802 VQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRLIFIDNQPKDLISVQQGSLGNFSDL 1981 

Query: 538 HIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGN 597 

HIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGN 
Sbjct: 1982 HIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGN 2161 

Query: 598 TAGFFYIDSDGSGPLGPLQWCNITEDKIWTSVQHNNTELTRVRGANPEKPYAMALDYGG 657 

TAGFFYIDSDGSGPLGPLQWCNITEDKIWTSVQHNNTELTRVRGANPEKPYAMALDYGG 
Sbjct: 2162 TAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGANPEKPYAMALDYGG 2341 

Query: 658 SMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQ 717 

SMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQ 
Sbjct: 2342 SMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQ 2521 



Query : 718 CECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPVTQIVITDTDRSNSEAAWRIG 777 

CECGLDESCLDIQHFCNCDADKDEOTNDTGFLSFKDHLPVTQIVITDTDRSNSEAAWRIG 
Sbjct: 2522 CECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPVTQIVITDTDRS 2701 

Query: 778 PLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKDFIR 837 

PLRCYGDRRFWAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKDFIR . 
Sbjct: 2702 PLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKDFIR 2881- 

Query: 838 LEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWHYVRAERNLKETSLQVDlSn^PRSTRE 897 

LEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWHYVRAERNLKETSLQVDNLPRSTRE 
Sbjct: 2882 LEISSPSEITFAIDVGNGPVELWQSPSLLOTNQWHYVRAERNLKETSLQVDNLPRSTRE 3061 

Query: 898 TSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLHLNGQKMDLEERAKVTSGVRPGCPGH 957 

TSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLHLNGQKMDLEERAKVTSGVRPGCPGH 
Sbjct: 3062 TSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLHLNGQKMDLEERAKVTSGVRPGCPGH 3241 

Query: 958 CSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVT 1017 

CSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVT 
Sbjct: 3242 CSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVT 3421 

Query: 1018 KNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLLFINSSSQDFVVVLLCKNGSLQVRYH 1077 

KNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLLFINSSSQDFVWLLCKNGSLQVRYH 
Sbjct: 3422 KNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLLFINSSSfQDFVVVLLCKNGSLQVRYH 3601 

Query: 1078 LNKEETHVFTIDADNFANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLT 1137 

LNKEETHVFTIDADOTAimRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLT 
Sbjct: 3602 LNKEETPfVFTIDADNFANRI^MHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLT 3781 

Query: 1138 LGKVTENLGLDSEVAKANAMGFAGC^^ 1197 

LGKVTENLGLDSEVAKANAMGFAGCMSSVQYNHIAPLKAALRHATVAPVTVHGTLTES 
Sbjct: 3782 LGKVTENLGLDSEVAKANAMGFAGCMSSVQYNHIA 3961 

Query: 1198 GFMVDSDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMT 1257 

GFMVBSDWAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAVVIFIIFCIIGIMT 
Sbjct: 3962 GFMVDSDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMT 4141 

Query: 1258 RFLYQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSECKREYFI 1304 

RFLYQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSECKREYFI 
Sbjct: 4142 RFLYQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSECKREYFI 4282 
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source 
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CDS 



casprS 5284 bp mRNA linear PRI 05-NOV-2002 

Homo sapiens casprS protein (casprS) , transcript variant 1, mRNA. 
NM_130773 

NM_130773.2 GI: 20544138 

• 

Homo s ap i en s ( human ) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 

Mammalia ; Eutheria ; Primates ; Catarrhini ; Hominidae ; Homo . 

1 

Takeuchi , K. , Wa t anabe , N . , Kawano , T . and Kawamura , K . 

In vitro and in vivo studies on the involvement of neural cell 

adhesion molecules and chondroitin sulfate proteoglycans, in 

defining discrete axonal pathways of the rat cerebral cortex 

Unpublished 

REVIEWED REFSEQ : This record has been curated by NCBI staff. The 
reference sequence was derived from AB077881 . 1 and AK056528 . 1 . 
On May 13, 2002 this sequence version replaced gi : 18640733 . 
Summary: This gene product belongs to the neurexin family, members 
of which function in the vertebrate nervous system as cell adhesion 
molecules and receptors. This protein, like other neurexin 
proteins, contains epidermal growth factor repeats and laminin G 
domains. In addition, it includes an F5/8 type C domain, 
discoidin/neuropilin- and f ibrinogen-like domains, and 
thrombospondin N-terminal-like domains. Alternative splicing of 
this gene results in 2 transcript variants encoding different 
isoforms. 

Transcript Variant: This variant (1) encodes the longer isoform 
(1) . 

Location/Qualifiers 
1. .5284 

/orgariism="Homo sapiens" 
/db_xref = " t axon : 9 6 0 6 " . 
/chromosome= " 2 " 
/map= n 2ql4.1 n 
1..5284 

/gene=" casprS" 

/note= " synonym : FL J3 1966" 

/db_xref= "LocusID : 129684 " 

365.. 4285 

/gene= " casprS 0 

/codon_start=l 

/product=" casprS protein isoform 1" 
/protein_id= n NP_570129.1 " 
/db_xref="GI: 18640734" 
/db_xref= "LocusID : 129684 " 

/translation= a MDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFS 
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SSSDLTGTHSPAQLNWRVGTGGWSPADSNAQQWLQMDLGNRVEITAVATQGRYGSSDW 
VTSYSLMFSDTGRNWKQYKQEDSIWTFAGN^ 

PSGKIGMRVEVYGCSYKSDVADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVL 
FHGEGQRGDHITLELQKGRLALHLNLGDSKARLSSSLPSATLGSLLDDQHWHSVLIER 
VGKQVNFTVDKHTQHFRTKGETDALDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLY 
YNGWIIDLAKRRKHQIYTGNVTFSCSEPQIVPITFVNSSGSYLLLPGTPQIDGLSVS 
FQFRTWNKDGLLLSTELSEGSGTLLLSLEGGILRLVIQKMTERVAEILTC 
HSVSINARRNRITLTLDDEAAPPAPDSTWVQIYSGNSYYFGGCPDNLTDSQCLNPIKA 
. FQGCMRLIFIDNQPKDLISVQQGSLGNFSDLHIDLCSIKDRCLPNYCEHGGSCSQSWT 
TFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGNTAGFFYIDSDGSGPLGPLQVYCNIT 
EDKIOTSVQHNNTELTRVRGANPEKPYAMALDYGGSMEQLEAVIDGSEHCEQEVAYHC 
RRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQCECGLDESCLDIQHFCNCDAD 
KDEWTNDTGFLSFKDP1LPVTQIVITDTDRSNSEAAWRIGPLRCYGDRRFWNAVSFYTE 
ASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKDFIRLEISSPSEITFAIDVGN 
GPVELWQSPSLLOTNQWHYVRAERNLKETSLQVDNLPRSTRETSEEGHFRLQLNSQL 
FVGGTSSRQKGFLGCIRSLHLNGQKMDLEERAKVTSGVRPGCPGHCSSYGSICHNGGK 
CVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVTKNISLSSSAIY 
TDSAPSKENIALSFVTTQAPSLLLFINSSSQDFVVVLLCKNGSLQVRYHLNKEETHVF 
TIDADNFANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLTLGKVTEN 
LGLDSEVAKANAMGFAGCMSSVQYNHIAPLKAALRHAWAPVTVHGTLTESSCGFMVD 
SDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMTRFL 
YQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSECKREYFI " 
530. .877 
/gene= " casprS " 

/note="F5_F8_type_C; Region: F5/8 type C domain. This 
domain is also known as the discoidin (DS) domain family. 
The bacterial examples are not yet included in the SEED 
alignment and are only found with low scores" 
/db_xref="CDD: pfam00754 " 
551. .886 
/gene= n casprS" 

/note="FA58C; Region: Coagulation factor 5/8 C-terminal 
domain, discoidin domain" 
/db_xref = "CDD : smart00231 " 
977.. 1378 
/gene= " casprS " 

/note="LamG; Region: Laminin G domain" 
/ db_xr e f = "CDD : smart00282 " 
989.. 1384 
/gene= " casprS " 

/note= n laminin_G; Region: Laminin G domain" 
/db_xref = "CDD : pfam00Q54 " 
1529. .1927 
/gene= " casprS " 

/note="LamG; Region: Laminin 'G domain" . 
7db_xref= "CDD: smart 002 82 " 
1688. .1936 
/gene= " casprS " 

/note="laminin_G; Region: Laminin G domain" 
/db_xref = " CDD : pfam00054 n 
2798. .3178 
/gene= "casprS" 

/note="LamG; Region: Laminin G domain" 
/db_xref = "CDD : smart00282 " 
2819. .3187 
/gene= " casprS " 

/note= n laminin_G; Region: Laminin G domain" 
/db_xref = " CDD : pfam00054 " 
3263. .3346 
/gene= " casprS " 
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/note= D EGF; Region: EGF-like domain. There is no clear 
separation between noise and signal. pfam00053 is very 
similar, but has 8 instead of 6 conserved cysteines. 
Includes some cytokine receptors. The family is difficult 
to model due to many similar but different sub- types of 
EGF domains 0 

/db_xref= n CDD: pfam00008 n 
' . misc_f eature 3470.. 3880 

/gene=°caspr5" 

Vnote= D LamG; Region: Laminin G domain" 
/db_xref = 0 CDD : smart00282 ° 
misc_f eature 3503.. 3880 

/gene= D caspr5 D 

/note= n laminin_G; Region: Laminin G domain" 

/db_xref="CDD: pfam00054 n 
polyA_signal 5271.. 5276 

/gene= n caspr5 n 
BASE COUNT 1364 a 1314 c 1305 g 1301 t 

ORIGIN 

1 gattccagct ctcgcgcccg acgaggtgga tttggctgtc caccgagctc cggcgcctgt 
61 cgttctaatt gggtttggat ttgcaccgtt aaggaggggg gaagagaagg aagaggcggg 
121 cgaggaaggc gagtccagct agcggctgtt gcggggaccg tagccccagc tgcagctccg 
181 aagaatcccc cgccacggtt tcggtggagc gtctgggcac gggatggagt gaaagagcga 
241 gtgcctctcc aagcgggggt gggagggggt caggctgtgc .agaggagaga gacagcgaga 
301 agaagccgcg gctggctact gcgaatttgg gattcgattg ggagggaccg ctcactcggg 
361 ggaaatggat tctttaccac ggctgaccag cgttttgact ttgctgttct ctggcttgtg 
421 gcatttagga ttaacagcga caaactacaa ctgtgatgat ccactagcat ccctgctctc 
481 tccaatggct ttttccagtt cctcagacct cactggcact cacagcccag ctcaactcaa 
541 ctggagagtt ggaactggcg gttggtcccc agcagattcc aatgctcaac agtggctcca 
601 gatggacctg ggaaacagag tagagattac agcagtggcc acgcagggaa gatacggaag 
661 ctctgactgg gtgacgagtt acagcctgat gttcagtgac acaggacgca actggaaaca 
721 gtacaaacaa gaagacagca tctggacctt tgcaggaaac atgaatgctg acagcgtggt 
781 gcaccacaag ctattgcact cagtgagagc ccgatttgtt cgctttgtgc ccctggaatg 
841 gaatcccagt gggaagattg gcatgagagt cgaggtctac ggatgttcct ataaatcaga 
901 tgttgctgac tttgatggcc gaagctcact tctgtacagg ttcaatcaga agttgatgag 
961 tactctcaaa gatgtgatct ccctgaagtt caagagcatg caaggagatg gggtcctgtt 
1021 ccatggagaa ggtcagcgtg gagaccacat caccttggaa ctccagaagg ggaggctcgc 
1081 cctacacctc aatttgggtg acagcaaagc gcggctcagc agcagcttgc cctctgccac 
1141 cctgggcagc ctcctggatg accagcactg gcactcggtc ctcattgagc gggtgggcaa 
1201 gcaggtgaac ttcacggtgg acaagcacac acagcacttc cgcaccaagg gcgagacgga 
1261 tgccttagac attgactatg agcttagttt tggaggaatt ccagtaccag gaaaacctgg 
1321 gaccttttta aagaaaaact tccatggatg catcgaaaac ctttactaca atggagtaaa 
1381 cataattgac ctggctaaga gacgaaagca tcagatctat actggcaatg tcactttttc 
1441 ctgctccgaa ccacagattg tgcccatcac atttgtcaac tccagcggca gctatttgct 
1501 gctgcccggc accccccaaa. ttgatgggct ctcagtgagt ttccagtttc gaacatggaa 
1561 caaggatggt ctgcttctgt ccacagagct gtctgagggc tcgggaaccc tgctgctgag . 
1621 cctggagggt ggaatcctga gactcgtgat tcagaaaatg acagaacgcg tagctgaaat. 
1681 cctcacaggc agcaacttga atgatggcct gtggcactcg gttagcatca acgccaggag 
1741 gaaccgcatc acgctcactc tggatgatga agcagcaccc ccggctccag acagcacttg 
1801 ggtgcagatt tattctggaa atagctacta ttttggaggg tgccccgaca atctcaccga 
1861 ttcccaatgt ttaaatccca ttaaggcttt ccaaggctgc atgaggctca tctttattga 
1921 taaccagccc aaggacctca tttcagttca gcaaggttcc ctggggaatt ttagtgattt 
1981 acacattgat ctgtgtagca tcaaagacag gtgtttgcca aactactgtg aacatggagg 
2041 aagctgctcc cagtcctgga ctaccttcta ttgtaactgc agtgacacaa gttacactgg 
2101 tgccacctgc cacaactcca tctacgagca atcctgcgag gtgtacaggc accaggggaa 
2161 tacagccggc ttcttctaca tcgactcaga tggcagcggc ccactgggac ctctccaggt 
2221 gtactgcaat atcactgagg acaagatctg gacatcagtg cagcacaaca atacagagct 
2281 gacccgagtg cggggcgcta accctgagaa gccctatgcc atggccttgg actacggggg 
2341 cagcatggaa cagctggagg ccgtgatcga cggctctgag cactgtgagc aggaggtggc 
2401 ctaccactgc aggaggtccc gcctgctcaa cacgccggat ggaacaccat ttacctggtg 
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2461 gattgggcgg tccaatgaaa ggcaccctta 
2521 gtgtgagtgt ggcctagacg agagctgcct 
2581 tgacaaggat gaatggacaa atgatactgg 
2641 cactcagata gttatcactg ataccgacag 
2701 tcccttgcgt tgctatggtg accgacgctt 
2761 ctcttacctc cactttccta ccttccatgc 
2821 taaaaccaca gcattatccg gagttttcct 
2881 actcgaaata agctctcctt cagagatcac 
2941 ggagcttgta gtccagtctc cttctcttct 
3001 tgagaggaac ctcaaggaga cctccctgca 
3061 gacgtcggag gagggccatt ttcgactgca 
3121 gtcatccaga cagaaaggct tcctaggatg 
3181 aatggacctg gaagagaggg caaaggtcac 
3241 ctgcagcagc tacggcagca tctgccacaa 
3301 ctacctgtgt gattgcacca attcacctta 
3361 tgctgttttt gaggctggca cgtcggttac 
3421 caagaatata agcctctcat cctcagctat 
3481 cattgcactt agctttgtga caacccaggc 
3541 ttctcaggac ttcgtggttg ttctgctctg 
3601 cctaaacaag gaagaaaccc atgtattcac 
3661 gatgcaccac ttgaagatta accgagaggg 
3721 acttcgactc agttataact tctctccgga 
3781 cttgggcaaa gtcacagaga atcttggttt 
3841 gggttttgct ggatgcatgt cttccgtcca 
3901 cctgcgccat gccactgtcg cgcctgtgac 
3961 tggcttcatg gtggactcag atgtgaatgc 
4021 ttttgggaag acagatgagc gggaaccact 
4081 catcggaggg gtgatagcag tggtgatatt 
4141 ccggttcctc taccagcaca agcagtcaca 
4201 tccagaaaat ttggacagtt ccttcagaaa 
4261 gtgtaaacgg gaatatttca tctgagaaac 
4321 tgttcaatta tctcctcccc ctcttctctc 
4381 ttctgcttgc catgtctttt ctggaacata 
4441 atccagccca agagaccagg cagccatggc 
4501 gtgaaaacga ccactcaaga gactgacttc 
4561 tgcactcctg catgttcagt tctgtacttc 
4621 aaccacttgg tggttcaggc ttgctttgaa 
4681 ctgacatcct ccccagctca agtctattct 
4741 acctagaggc ctggtttgct ttggtggcat 
4801 ggtggtttgc tttctttacc ataagcaatc 
4861 atgaccctta gaccctgagt attttcaaat 
4921 actttgttcc tttcttacca ctctctcctg 
4981 cataaagcta ggggatgcat ggaaatagca 
5041 ctaggaagta gatgttccat atcttcaaaa 
5101 taggtattcc tgggattatt atactgagat 
5i61 gtatatatgt atatatatat gtgagtatat 
5221 atatatacac acacgcacac atatatgttg 
5281 taaa 



ctggggaggt tcccctcctg gggtccagca 
ggacattcag cacttttgca attgcgacgc 
ctttctttcc ttcaaagacc acttgcctgt 
atcaaactca gaagccgctt ggagaattgg 
ctggaacgcc gtctcatttt atacagaagc 
ggaattcagt gccgatattt ccttcttttt 
agaaaatctt ggcattaaag acttcattcg 
ctttgccatc gatgttggga atggtcctgt 
gaatgacaac caatggcact atgtccgggc 
ggtggacaac cttccaagga gcaccaggga 
gctgaacagc cagttgtttg tagggggaac 
cattcgctcc ttacacttga atggacagaa 
atctggagtc aggccaggct gccccggcca 
cgggggcaag tgtgtggaga agcacaatgg 
tgaagggccc ttttgcaaaa aagaggtttc 
ttacatgttt caagaaccct atcctgtgac 
ttacacagat tcagctccat ccaaggaaaa 
acccagtctt ttgctcttta tcaattcttc 
caagaatgga agcttacagg ttcgctatca 
cattgatgca gataactttg ctaacagaag 
aagagagctt accattcaga tggaccagca 
agtagagttc agggttataa ggtcactcac 
ggattctgaa gttgctaaag caaatgccat 
gtacaaccac .atagcaccac tgaaggctgc 
tgtccatggg accttgacgg aatccagctg 
agtgaccacg gtgcattctt catcagatcc 
cacaaatgct gttcgaagtg attcggcagt 
catcatcttc tgtatcatcg gcatcatgac 
tcgtacgagc cagatgaagg agaaggaata 
tgaaattgac ttgcaaaaca cagtgagcga 
tgcagggttc ctactactct tttttcttgt 
ctgtcttttg atttggtcat tctctttatt 
cttgcatcca ccacagcatc aattcccttg 
cactgccttc ctctctgatg aacctatcgg 
gccattcaag acaaggaaga gacacatgtg 
cagtttctaa aatgcactgt tcagttttcc 
cctgagctct taggcacatg acggtcattc 
taccatagaa cccagggcag ggagagaaga 
tgtaaaaaga gtaagagagg tttggtttgt 
ccttgcctta actcatcacc ctttttcact 
atatgattgc tgatagtagt gaccaaaact 
gggccgacac gttgggacag cacaccatag 
gcttgaaact aggaggtaac aagaaagctt 
tgcctcctcc aattttgtaa gaatgctagc 
atatatatat acacacacac acacatatgt 
atacacacac acacacacac acacatatat 
ctgcagcata aagaaattga aataaaagtt 
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>AB077881 ACCESSION: AB077881 NID: gi 18181975 dbj AB077881.1 Homo 

sapiens mRNA for casprS, complete cds 
Length =4920 

Score = 2567 bits (6581), Expect = 0.0 o,,,^ tn*\ 

.Identities = 1303/1307 (99%), Positives = 1303/1307 (99%), Gaps = 3/1307 (0%) 

Frame =■ +1. 

Ouery 1 MDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFSSSSDLTGTHSPAQLNW 

MDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFSSSSDLTGTHSPAQLNW 
Sbjct: 1 MDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFSSSSDLTGTHSPAQLNW 180 

Ouery- 61 RVGTGGWSPADSNAQQWLQMDLGNRVEITAVATQGRYGSSDWVTSYSLMFSDTGRNWKQY 120 

RVGTGGWSPADSNAQQWLQMDLGNRVEITAVATQGRYGSSDWVTSYSLMFSDTGRNWKQY 
Sbjct: 181 RVGTGGWSPADSNAQQWLQMDLGNRVEITAVATQGRYGSSDWVTSYSLMFSDTGRNWKQY 360 

Query' 121 KQEDSIWTFAGNMNADSVVHHKLLHSVTIARFVRFVPLEWNPSGKIGMRVEVYGCSYKSDV 180 

KQEDSIWTFAGNMNADSWHHKLLHSVRARFVRFVPLEWNPSGKIGMRVEVYGCSYKSDV 
Sbjct: 361 KQEDSIWTFAGNMNADSWHHKLLHSVRARFVRFVPLEWNPSGKIGMRVEVYGCSYKSDV 540 

Query- 181 ADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLELQKGRLAL 240 

ADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLELQKGRLAL 
Sbjct: 541 ADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLELQKGRLAL 720 

Ouerv- 241 HLNLGDSKARLSSSLPSATLGSLLDDQHWH-VLIERVGKQVNFTVDKHTQHFRTKGETDA 299 

HLNLGDSKARLSSSLPSATLGSLLDDQHWH VLIERVGKQVNFTVDKHTQHFRTKGETDA 
Sbjct: 721 HLNLGDSKARLSSSLPSATLGSLLDDQHWH SVLIERVGKQVNFTVDKHTQHFRTKGETDA 900 

Query- 30T) LDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLYYNGWII-LAKRRKHQIYTVGNVTFS 358 

LDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLYYNGVNII LAKRRKHQ I YT GNVTFS 
Sbjct: 901 LDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLYYNGVNIIDLAKRRKHQIYT-GNVTFS 1077 

Query 359 CSEPQIVPITF-NSSGSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEGSGTLLLS 417 

CSEPQIVPITF NSSGSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEGSGTLLLS 
Sbjct: 1078 CSEPQIVPITFVNSSGSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEGSGTLLLS 1257 

Query- 418 LEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAPPAPDSTW 477 

LEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAPPAPDSTW 
Sbjct: 1258 LEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAPPAPDSTW 1437 

Query- 478 VQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRLIFIDNQPKDLISVQQGSLGNFSDL 537 

VQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRLIFIDNQPKDLISVQQGSLGNFSDL 
Sbjct: 1438 VQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRLIFIDNQPKDLISVQQGSLGNFSDL 1617 

Ouerv- 538 HIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGN 597 

HIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGN 
Sbjct: 1618 HIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGN 1797 

Query- 598 TAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGANPEKPYAMALDYGG 657 

TAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGANPEKPYAMALDYGG 
Sbjct: 1798 TAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGANPEKPYAMALDYGG 1977 

Query- 658 SMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQ 717 

SMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQ 
Sbjct- 1978 SMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQ 2157 



CECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPVTQIVITDTDRSNSEAAWRIG 
^oor ^c/-T nTnHT?PMnnanKnT^MDTGFLSFKDHLPVTOIVITDTDRSNSEAAWRIG 



Ouerv- 718 CECGLDESCLDIQHFCNCDM)KDEWTNDTUri J btKi)nijrviwivj.iwiw^«^^««-«"^« 
' CECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPVTQIVITDTDRSNSEAAWRIG 
Sbjct: 2158 CECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPVTQIVITDTDRSNSEAAWRIG 2337 

Ouerv 778 PLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKDFIR 837 

PLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSGVFLENLiGIKDFIR 
Sbjct: 2338 PLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKDFIR 2517 

Query 838 LEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWHYVRAERNLKETSLQVDNLPRSTRE 897 

LEISSPSEITFAIDVGNGPVELVVQSPSLLNDNQWHYVRAERNLKETSLQVDNLPRSTRE 
Sbjct- 2518 LEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWHYVRAERNLKETSLQVDNLPRSTRE 2697 



Query: 898 



TSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLHLNGQKMDLEERAKVTSGVRPGCPGH 

TSEEGHFRLQLNSQLFVGGTS SRQKGFLGC I RSLHLNGQKMDLEERAKVTSGVRPGC PGH 
Sbjct: 2698 TSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLHLNGQKMDLEERAKVTSGVRPGCPGH 2877 

Query- 958 CSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVT 1017 

CSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVT 
Sbjct: 2878 CSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVT 3057 

Query- 1018 KNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLLFINSSSQDFVWLLCKNGSLQVRYH 1077 

KNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLLFINSSSQDFWVLLCKNGSLQVRYH 
Sbjct: 3058 KNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLLFINSSSQDFVWLLCKNGSLQVRYH 3237 

Query- 1078 LNKEETHVFTIDADNFANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLT 1137 

LNKEETHVFTIDADNFANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLT 
Sbjct: 3238 LNKEETHVFTIDADNFANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLT 3417 

Query- 1138 LGKVTENLGLDSEVAKANAMGFAGCMSSVQYNHIAPLKAALRHATVAPVTVHGTLTESSC 1197 

LGKVTENLGLDSEVAKANAMGFAGCMSSVQYNHIAPLKAALRHATVAPVTVHGTLTESSC 
Sbjct: 3418 LGKVTENLGLDSWAKANAMGFAGCMSSVQYNHIAPLKAALRHATVAPVTVHGTLTESSC 3597 

Query- 1198 GFMVDSDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMT 1257 

GFMVDSDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMT 
Sbjct: 3598 GFMVDSDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMT 3777 

Query- 1258 RFLYQHKQSHRTSQMKEKE Y P ENLDS S F RNE I DLQNTVS EC KRE YF I 1304 

RFLYQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSECKREYFI 
Sbjct- 3778 RFLYQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSECKREYFI 3918 
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□ 1: AB07788L Homo sapiens mRNA...[gi:18181975] 



Links 



LOCUS 

DEFINITION 

ACCESSION 

VERSION 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 



JOURNAL 
REFERENCE 
AUTHORS. 
TITLE 
JOURNAL 



FEATURES 

source 



gene 
CDS 



AB077881 4920 bp mRNA linear PRI 17-JAN-2002 

Homo sapiens mRNA for casprS, complete cds. 

AB077881 

AB077881.1 GI: 18181975 
Homo sapiens (human) 

Homo sapiens . 
Eukaryota; Metazoa; Chordata; Craniata; Vertebrate; Euteleostomi; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 

Takeuchi,K., Watanabe,N., Kawano,T. and Kawamura,K. 

In vitro and in vivo studies on the involvement of neural cell 

adhesion molecules and chondroitin sulfate proteoglycans in 

defining discrete axonal pathways of the rat cerebral cortex 

Unpublished 

2 (bases 1 to 4920) 

Takeuchi,K. 

Direct Submission 

Submitted (12-JAN-2002) Kosei Takeuchi, Nagoya University, Dept. of 
Biological Sciences; Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8602, 
Japan (E-mail : ktakeuch@bioll . bio . nagoya -u .ac.jp, 
Tel: 81-52-789-2496, Fax:81-052-789-2968) 

Location/Qualifiers 

1. .4920 

/organism="Homo sapiens" 
/ db_xr e f = n t axon : 9 6 0 6 " 
/tissue_type= "brain" 
1..4920 

/gene="caspr5" 
1..3921 

/gene=°caspr5" 
/codon_start=l . 
/product="caspr5" : 

/protein id= n BAB83897 . 1 ° 
/db_xref="GI: 18181976" 

/ translat ion= " MDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFS 

S S SDLTGTH S P AQLNWRVGTGGW S PAD SNAQQWLQMDLGNRVE ITAVATQGRYGS SDW 
VTSYSLMFSDTGRlWKQYKQEDSIWTFAGNMNADSVVHHKLLHSVRARFVRFVPIiEWN 
PSGKIGMRVEVYGCSYKSDVADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVL 

FHGEGQRGDH I TLELQKGRLALHLNLGD S KARLS S SL P S ATLGSLLDDQHWHSYLI ER 
VGKQVNFTVDKHTQHFRTKGETDALDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLY 
YNGVNIIDLAKRRKHQIYTGNVTFSCSEPQIVPITFVNSSGSYLLLPGTPQIDGLSVS 
FQFRTWNKDGLLLSTELSEGSGTLLLSLEGGILRLVIQKMTERVAEILTGSNLNDGLW 
HSVSINARRNRITLTLDDEAAPPAPDSTWVQIYSGNSYYFGGCPDNLTDSQCLNPIKA 
FQGCMRLIFIDNQPKDLISVQQGSLGNFSDLHIDLCSIKDRCLPNYCEHGGSCSQSWT 
TFYCNCSDTSYTGATCHNSIYEQSCEVYRHQGNTAGFFYIDSDGSGPLGPLQVYCNIT 
EDKIWTSVQHNNTELTRVRGANPEKPYAMALDYGGSMEQLEAVIDGSEHCEQEVAYHC 
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RRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPGVQQCECGLDESCLDIQHFCNCDAD 

KDEWTNOTGFLSFKDHLPVTQIVITDTDRSNSEAAWRIGPLRCYGDRRFWNAVSFYTE 

ASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKDFIRLEISSPSEITFAIDVGN 

GPVELWQSPSLLNDNQWHYVRAERNLKETSIiQVDNLPRSTRETSEEGHFRLQI^ 

FVGGTS SRQKGFLGC IRSLHLNGQKMDLEERAKVTSGVRPGC PGHCS S YGSICHNGGK 

CVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVTKNISLSSSAIY 

TDSAPSKENIALSFVTTQAPSLLLFINSSSQDFVVVLLCKNGSLQVRYHLNKEETHW 

TIDADNFANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLTLGKVTEN 

■ LGLDSEVAKANAMGFAGCMSSV^ 

SDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGWIAWIFIIFCIIGIMTRFL 

YQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSECKREYFI" 
BASE COUNT 1292 a 1229 c 1160 g 1239 t 

ORIGIN 

1 atggattctt taccacggct gaccagcgtt ttgactttgc tgttctctgg cttgtggcat 
61 ttaggattaa cagcgacaaa ctacaactgt gatgatccac tagcatccct gctctctcca 
121 atggcttttt ccagttcctc agacctcact ggcactcaca gcccagctca actcaactgg 
181 agagttggaa ctggcggttg gtccccagca gattccaatg ctcaacagtg gctccagatg 
241 gacctgggaa acagagtaga gattacagca gtggccacgc agggaagata cggaagctct 
301 gactgggtga cgagttacag cctgatgttc agtgacacag gacgcaactg gaaacagtac ' 
361 aaacaagaag acagcatctg gacctttgca ggaaacatga atgctgacag cgtggtgcac 
421 cacaagctat tgcactcagt gagagcccga tttgttcgct ttgtgcccct ggaatggaat 
481 cccagtggga agattggcat gagagtcgag gtctacggat gttcctataa atcagatgtt 
541 gctgactttg atggccgaag ctcacttctg tacaggttca atcagaagtt gatgagtact 
601 ctcaaagatg tgatctccct gaagttcaag agcatgcaag gagatggggt cctgttccat 
661 ggagaaggtc agcgtggaga ccacatcacc ttggaactcc 'agaaggggag gctcgcccta 
721 cacctcaatt tgggtgacag caaagcgcgg ctcagcagca gcttgccctc tgccaccctg 
781 ggcagcctcc tggatgacca gcactggcac tcggtcctca ttgagcgggt gggcaagcag 
841 gtgaacttca cggtggacaa gcacacacag cacttccgca ccaagggcga gacggatgcc 
901 ttagacattg actatgagct tagttttgga ggaattccag taccaggaaa acctgggacc 
961 tttttaaaga aaaacttcca tggatgcatc gaaaaccttt actacaatgg agtaaacata 
102l'attgacctgg ctaagagacg aaagcatcag atctatactg gcaatgtcac tttttcctgc 
1081 tccgaaccac agattgtgcc catcacattt gtcaactcca gcggcagcta tttgctgctg 
1141 cccggcaccc cccaaattga tgggctctca gtgagtttcc agtttcgaac atggaacaag 
1201 gatggtctgc ttctgtccac agagctgtct gagggctcgg gaaccctgct gctgagcctg 
1261 gagggtggaa tcctgagact cgtgattcag aaaatgacag aacgcgtagc tgaaatcctc 
1321 acaggcagca acttgaatga tggcctgtgg cactcggtta gcatcaacgc caggaggaac 
1381 cgcatcacgc tcactctgga tgatgaagca gcacccccgg ctccagacag cacttgggtg 
1441 cagatttatt ctggaaatag ctactatttt ggagggtgcc ccgacaatct caccgattcc 
1501 caatgtttaa atcccattaa ggctttccaa ggctgcatga ggctcatctt tattgataac 
1561 cagcccaagg acctcatttc agttcagcaa ggttccctgg ggaattttag tgatttacac 
1621 attgatctgt gtagcatcaa agacaggtgt ttgccaaact actgtgaaca tggaggaagc 
1681 tgctcccagt cctggactac cttctattgt aactgcagtg acacaagtta cactggtgcc 
1741 acctgccaca actccatcta cgagcaatcc tgcgaggtgt acaggcacca ggggaataca 
1801 gccggcttct tctacatcga ctcagatggc agcggcccac tgggacctct ccaggtgtac 
1861 tgcaatatca ctgaggacaa gatctggaca tcagtgcagc acaacaatac agagctgacc 
. 1921 cgagtgcggg gcgctaaccc tgagaagccc tatgccatgg ccttggacta cgggggcagc 
1981 atggaacagc tggaggccgt gatcgacggc tctgagcact gtgagcagga ggtggcctac 
2041 cactgcagga ggtcccgcct gctcaacacg ccggatggaa caccatttac ctggtggatt 
2101 gggcggtcca atgaaaggca cccttactgg ggaggttccc ctcctggggt ccagcagtgt 
2161 gagtgtggcc tagacgagag ctgcctggac attcagcact tttgcaattg cgacgctgac 
2221 aaggatgaat ggacaaatga tactggcttt ctttccttca aagaccactt gcctgtcact 
2281 cagatagtta tcactgatac cgacagatca aactcagaag ccgcttggag aattggtccc 
2341 ttgcgttgct atggtgaccg acgcttctgg aacgccgtct cattttatac agaagcctct 
2401 tacctccact ttcctacctt ccatgcggaa ttcagtgccg atatttcctt cttttttaaa 
2461 accacagcat tatccggagt tttcctagaa aatcttggca ttaaagactt cattcgactc 
2521 gaaataagct ctccttcaga gatcaccttt gccatcgatg ttgggaatgg tcctgtggag 
2581 cttgtagtcc agtctccttc tcttctgaat gacaaccaat ggcactatgt ccgggctgag 
2641 aggaacctca aggagacctc cctgcaggtg gacaaccttc caaggagcac cagggagacg 
2701 tcggaggagg gccattttcg actgcagctg aacagccagt tgtttgtagg gggaacgtca 
2761 tccagacaga aaggcttcct aggatgcatt cgctccttac acttgaatgg acagaaaatg 
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2821 gacctggaag agagggcaaa ggtcacatct ggagtcaggc caggctgccc cggccactgc 
2881 agcagctacg gcagcatctg ccacaacggg ggcaagtgtg tggagaagca caatggctac 
2941 ctgtgtgatt gcaccaattc accttatgaa gggccctttt gcaaaaaaga ggtttctgct 
3001 gtttttgagg ctggcacgtc ggttacttac atgtttcaag aaccctatcc tgtgaccaag 
3061 aatataagcc tctcatcctc agctatttac acagattcag ctccatccaa ggaaaacatt 
3121 gcacttagct ttgtgacaac ccaggcaccc agtcttttgc tctttatcaa ttcttcttct 
3181 caggacttcg tggttgttct gctctgcaag aatggaagct tacaggttcg ctatcaccta 
3241 aacaaggaag aaacccatgt attcaccatt gatgcagata actttgctaa cagaaggatg 
3301 caccacttga agattaaccg agagggaaga gagcttacca ttcagatgga ccagcaactt 
3361 cgactcagtt ataacttctc tccggaagta gagttcaggg ttataaggtc actcaccttg 
3421 ggcaaagtca cagagaatct tggtttggat tctgaagttg ctaaagcaaa tgccatgggt 
3481 tttgctggat gcatgtcttc cgtccagtac aaccacatag caccactgaa ggctgccctg 
3541 cgccatgcca ctgtcgcgcc tgtgactgtc catgggacct tgacggaatc cagctgtggc 
3601 ttcatggtgg actcagatgt gaatgcagtg accacggtgc attcttcatc agatcctttt 
3661 gggaagacag atgagcggga accactcaca aatgctgttc gaagtgattc ggcagtcatc 
3721 ggaggggtga tagcagtggt gatattcatc atcttctgta tcatcggcat catgacccgg 
3781 ttcctctacc agcacaagca gtcacatcgt acgagccaga tgaaggagaa ggaatatcca 
3841 gaaaatttgg acagttcctt cagaaatgaa attgacttgc aaaacacagt gagcgagtgt 
3901 aaacgggaat atttcatctg agaaactgca gggttcctac tactcttttt tcttgttgtt 
3961 caattatctc ctccccctct tctctcctgt cttttgattt ggtcattctc tttattttct 
4021 gcttgccatg tcttttctgg aacatacttg catccaccac agcatcaatt cccttgatcc 
4081 agcccaagag accaggcagc catggccact gccttcctct ctgatgaacc tatcgggtga 
4141 aaacgaccac tcaagagact gacttcgcca ttcaagacaa ggaagagaca catgtgtgca 
4201 ctcctgcatg ttcagttctg tacttccagt ttctaaaatg cactgttcag ttttccaacc 
4261 acttggtggt tcaggcttgc tttgaacctg agctcttagg 'cacatgacgg tcattcctga 
4321 catcctcccc agctcaagtc tattcttacc atagaaccca gggcagggag agaagaacct 
4381 agaggcctgg tttgctttgg tggcattgta aaaagagtaa gagaggtttg gtttgtggtg 
4441 gtttgctttc tttaccataa gcaatccctt gccttaactc atcacccttt ttcactatga 
4501 cccttagacc ctgagtattt tcaaatatat gattgctgat agtagtgacc aaaactactt 
4561 tgttcctttc ttaccactct ctcctggggc cgacacgttg ggacagcaca ccatagcata 
462Taagctagggg atgcatggaa atagcagctt gaaactagga ggtaacaaga aagcttctag 
4681 gaagtagatg ttccatatct tcaaaatgcc tcctccaatt ttgtaagaat gctagctagg 
4741 tattcctggg attattatac tgagafcatat atatatacac acacacacac atatgtgtat 
4801 atatgtatat atatatgtga gtatatatac acacacacac acacacacac atatatatat 
4861 atacacacac gcacacatat atgttgctgc agcataaaga aattgaaata aaagtttaaa 



// 
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Caspr2, a new member of the neurexin superfamily, is localized 
at the juxtaparanodes of myelinated axons and associates with 
K+ channels. 

Poliak S, Gollan L, Martinez R, Custer A, Einheber S, Salzer JL, 
Trimmer JS, Shrager P, Peles E. 

w 

Department of Molecular Cell Biology, The Weizmann Institute of Science, 
Rehovot, Israel. 

Rapid conduction in myelinated axons depends on the generation of 
specialized subcellular domains to which different sets of ion channels are 
localized. Here, we describe the identification of Caspr2, a mammalian 
homolog of Drosophila Neurexin IV (Nrx-IV), and show that this neurexin- 
like protein and the closely related molecule Caspr/Paranodin demarcate 
distinct subdomains in myelinated axons. While contactin-associated protein 
(Caspr) is present at the paranodal junctions, Caspr2 is precisely colocalized 
with Shaker-like K+ channels in the juxtaparanodal region. We further show 
that Caspr2 specifically associates with Kvl.l, Kvl.2, and their Kvbeta2 
subunit. This association involves the C-terminal sequence of Caspr2, which 
contains a putative PDZ binding site. These results suggest a role for Caspr 
family members in the local differentiation of the axon into distinct 
functional subdomains. 

PMID: 10624965 [PubMed - indexed for MEDLINE] 
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Caspr3 and caspr4, two novel members of the caspr family are 
expressed in the nervous system and interact with PDZ 
domains. 

Spiegel I, Salomon D, Erne B, Schaeren-Wiemers N, Peles E. 

Department of Molecular Cell Biology; The Weizmann Institute of Science, 
Rehovot 76100, Israel. 

The NCP family of cell-recognition molecules represents a distinct subgroup 
of the neurexins that includes Caspr and Caspr2, as well as Drosophila 
Neurexin-IV and axotactin. Here, we report the identification of Caspr3 and 
Caspr4, two new NCPs expressed in nervous system. Caspr3 was detected 
along axons in the corpus callosum, spinal cord, basket cells in the 
cerebellum and in peripheral nerves, as well as in oligodendrocytes. In 
contrast, expression of Caspr4 was more restricted to specific neuronal 
subpopulations in the olfactory bulb, hippocampus, deep cerebellar nuclei, 
and the substantia nigra. Similar to the neurexins, the cytoplasmic tails of 
Caspr3 and Caspr4 interacted differentially with PDZ domain-containing 
proteins of the CASK/Lin2-Veli/Lin7-Mintl/LinlO complex. The structural 
organization and distinct cellular distribution of Caspr3 and Caspr4 suggest 
a potential role of these proteins in cell recognition within the nervous 
system, (c) 2002 Elsevier Science (USA). 

PMID: 12093160 [PubMed - indexed for MEDLINE] 



w DispajyllAbstract 

M** ^i^m ^^r 




Show: 



20 !H 



Sort 



m 



Send to £1 File 



Write to the Help Desk 



NCBI | NLM 



Department of Health & Human Services 



NIH 



Freedom of Information Act | Disclaimer 



i686-pc-linux-guu Dec 13 2002 14:22:59 



http://www.ncbi.nlm.nih.gov/entrez/query .fcgi?cmd=Retrieve&db=PubMed&list_uids=120 12/16/2002 



/? % Compare Genomic Sequences 




Page 1 of 3 



FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAAa9aWNs: 1331 aa 

>gi|6694278|gb|AAF25199.1|AF193613_l cell recognition molecule Caspr2 

vs /tmp/fastaDAAb9aWNs library 
searching / tmp/f astaDAAb9aWNs library 



[Homo sapiens 



1154 residues in 



1 sequences 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: opt 
gi | 16306509 | ref |NP_387504 . 1 | cell recognition mol (1154) 1595 

»gi| 16306509 | ref|NP_387504.1 | cell recognition molecule (1154 aa) 

initn: 3285 initl: 1327 opt: 1595 
Smith-Waterman score: 3418; 41.992% identity in 1305 aa overlap (12-1305:2-1153) 



gi 



gi 



gi 



gi 



gi 



gi 



gi 
gi 



gi 



gi 



10 20 30 40 50 

669 MQAAPRAGCGAALLLWIVSSCL CRAWT--APSTSQKCDEPLVSGLPHVAFSSSSSI 



163 



MASVAWAVLKVLLLLPTQTWSPVGAGNPPDCDAPLASALPRSSFSSSSEL 
10 20 30 40 50 



60 70 80 90 100 110 

669 SGSYSPGYAKINKRGGAGGWSPSDSDHYQWLQVDFGNRKQISAIATQGRYSSSDWVTQYR 



163 SSSHGPGFSRLNRRDGAGGWTPLVSNKYQWLQIDLGERMEVTAVATQGGYGSSDWVTSYL 

60 70 80 90 100 110 

120 130 140 150 160 170 

669 MLYSDTGRJSnARCPYHQDGNIWAFPGNINSDGVVRHELQHPIIARYVRIVPLDWNGEGRIGL 

163 LMFSDGGRNWKQYRREESIWGFPGNTNADSVVHYRLQPPFEARFLRFLPLAWNPRGRIGM 

120 130 140 150 160 170 

180 190 200 210 220 230 

669 RIEVYGCSYWADVINFDGHWLPYRFRNKKMOT 

*....... . ... ... . . ■> ..*••■ 

163 RIEVYGCAYKSEWYFDGQSALLYRLDKKPLKPIRDVISLKFKAMQSNGILLHREGQHGN 

.180 . 190 200 - 210 220 230 

240 250 260 270 280 290 

669 YITLELKKAKLVIiSLNLGSNQLGPIYGHTSWTGSLLDDHHWHSVVIERQGRSINLTLDR 



* * * 



163 HITLELIKGKLVTFLNSGNAKLPSTIAPVT'LTLGSLLDDQHWHSvl.IELLDTQVNFTVDK 

240 250 260 270 280 290 

300 310 320 330 340 350 

669 SMQHFRTNGEFDYLDLDYEITFGGIPFSGKPSSSSRKNFKGCMESINYNGVNITDLARRK 



* # 



*•»*** 



163 HTHHFQAKGDS S YLDLNFEI S FGG I PTPGRS RAFRRKS FHGCLENLYYNGVDVTELAKKH 

300 310 320 330 340 350 



360 370 380 390 400 410 

gi | 669 KLEPSNVGNLSFSCVEPYTVPV-FFNATSYLEVPGRLNQDLFSVSFQFRTWNPNGLLVFS 
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gi 1 163 KPQILMMGWSFSCPQPQTVPVTFLSSRSYLAL^^ 

360 370 380 390 400 410 



gi 



gi 
gi 



gi 



gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



420 430 440 450 460 470 

669 HFAD^^llGNVEIDLTESKVGVHINITQTKMS 

: . . : . . : . - • 

163 ELRRGSGSFVLFLKDGKL--KLSLFQPGQSPRNVTAGAGIJ^GQWHSVSFSAKWSHM^^ 

420 .430 440 450 460 

480 490 500 510 520 530 

669 I DGDEAS AVRTNS PLQVKTGEKYFFGGFLNQMNNS SHSVLQPS FQGCMQL IQVDDQLVNL 



* * 



* • • 



* * * * 



163 VDDD — T AVQ PLVAVL I D S GDT YYF G 
470 480 490 



■DAAWTWQ- 
500 



540 550 560 570 580 590 

669 YEVAQRKPGSFANVSIDMCAIIDRCVPNHCEHGGKCSQTWDSFKCTCDETGYSGATCHNS 



163 



■HGGP- 



•DAVTLRGA- 
510 



600 610 620 630 640 650 

669 IYEPSCEAYKHLGQTSNYYWIDPDGSGPLGPLKVYCNMTEDKVWTIVSHDLQMQTPVVGY 



* 4 * 



163 



PSG- 



. 660 670 . 680 690 700 710 

669 NPEKYSVTQLVYSASMDQISAITDSAEYCEQWSYFCKMSRLLNTPDGSPYTWWVGKANE 



* • 



163 HPR--SAVSFAYAAGAGQLRSAVNLAERCEQRLALRCGTARRPDSRDGTPLSWWVGRTNE 
520 530 540 550 560 570 

720 730 740 750 760 770 

669 KHYYWGGSGPGIQKCACGIERNCTDPKYYCNCDADYKQWRKDAGFLSYKDHLPVSQVWG 



* ***** 



163 THTYWGGSLPDAQKCTCGLEGNCIDSQYYCNCDAGRNEWTSDTIVLSQKEHLPVTQIVMT 
580 590 600 610 620 630 

780 790 800 810 820 830 

669 DTDRQGSEAKLSVGPLRCQGDRNYWNAASFPNPSSYLHFSTFQGETSADISFYFKTLTPW 

... . . . ;.::...::.::: . .::::: . : . : : . : : . : . : : : 

163 DTGQPHSEADYTLGPLLCRGDQSFWNSASFNTETSYLHFPAFHGELTADVCFFFKTTVSS 



640 



650 



660 



670 



680 



690 



840 850 860 870 880 890 

669 GVFLENMGKEDF I KLELKS ATEVSF SFDVGNGPVE I WRS PTPLNDDQWHRVTAERNVKQ 



* * 



163 GOTMENLGITDFIRIELRAPTEVTFSFDVGNGPCEVTVQSPTPFNDNQWHHVRAERNVKG 
700 710 720 730 740 750 

900 910 920 930 940 950 

669 ASLQVDRL PQQ I RKAPTEGHTRLEL YSQLFVGG - AGGQQGFLGC I RSLRMNGVTLDLEER 



* * 



* * * * * * * 

* ******* 



****** 



163 ASLQVDQLPQKMQPAPADGHVRLQLNSQLFIGGTATRQRGFLGCIRSLQLNGVALDLEER 
760 770 780 790 800 810 



960 970 980 990 1000 1010 

gi | 669 AKVTSGFISGCSGHCTSYGTNCENGGKCLERYHGYSCDCSNTAYDGTFCNKDVGAFFEEG 
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gi 1 163 ATVTPGVEPGCAGHCSTYGHLCRNGGRCREKRRGVTCDCAFSAYDGPFCSNEISAYFATG 

820 830 840 850 860 870 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



1020 1030 1040 1050 1060 1070 

669 MWLRYNFQAPATNARDS S SRVDNAPDQQNSHPD - - LAQEE I RF S FSTTKAPC I LL Y I S S F 



163 SSMTYHFQEHYTLSENSSSLVSSL HRDVTLTREMITLSFRTTRTPSLLLYVSSF 

880 890 900 910 920 

1080 1090 1100 1110 1120 1130 

669 TTDFLAVLVKPTGSLQIRYNLGGTREPYNIDVDHRNMANGQPHSWITRHEKTIFLKLDH 



163 YEEYLSVILANNGSLQIRYKLDRHQNPDAFTFDFKNMADGQLHQVKINREEAVVMVEVNQ 
930 940 950 960 970 980 

1140 1150 1160 1170 1180 1190 

669 YPSVSYHLPSSSDTLFNSPKSLFLGKVIETGKIDQEIHKYNTPGFTGCLSRVQFNQIAPL 

a . a . . . •■>•*■>• • • ••••••••• 

163 --STKKQVILSSGTEFNAVKSLILGKVLEAAGADPDTRRAATSGFTGCLSAVRFGRAAPL 
990 1000 1010 1020 1030 1040 

1200 1210 1220 1230 1240 

669 KAALRQTNAS AHVH I QGELVE - SNCGAS PLTLS PMS S ATDPWHLDHLDS AS ADFPYNPGQ 

■ ■>>* • • » _ * . ii . 

**■•■■■•»■•■•■•• ••••• • •••• 

163 KAALRPSGPS-RVTVRGHVAPMARCAAGAASGSPARELAPRLAGGAGRSGPAD E 

1050 1060 1070 1080 1090 

1250 1260 .1270 1280 1290 1300 

669 GQAIRNGVNRNSAIIGG^IAWIFTILCTLWLIRYMFRHKGTYHTNEAKGAESAESADA 



163 GEPLVNADRRDSAVIGGVIAWIFILLCITAIAIR-IYQQRKLRKENESKVSKKEEC 
1100 1110 1120 1130 1140 1150 



1310 1320 1330 

gi|669 AIMNNDPNFTETIDESKKEWLI 



1331 residues in 1 query sequences 
1154 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:41:43 2002 done: Mon Dec 16 15:41:44 2002 
Scan time: 0.050 Display time: 2.167 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R- Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaOAA78aisO: 1331 aa 
>gi|6694278|gb|AAF25199.l|AF193613_l cell recognition molecule Caspr2 

vs /tmp/fastaPAA88aisO library 
searching /tmp/fastaPAA88aisO library . 



[Homo sapiens 



1311 residues in 



1 sequences 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: °Pt 
gi | 18496979 | ref | NP_207837 . 1 | cell recognition pro (1311) 4248 

»gi [18496979 | ref | NP_207837 . 1 | cell recognition protein (1311 aa) 

initn: 3152 initl: 1339 opt: 4248 
Smith-Waterman score: 4406; 48.304% identity in 1327 aa overlap (13-1330:7-1310) 



gi 



gi 



gi 



gi 
gi 



gi 



gi 



gi 
gi 



gi 



gi 



10 20 30 

669 MQAAPRAGCGAALLLWIVSSCLCRAWTAPSTS 



40 50 
QKCDEPLVSGLPHVAFSSSSSIS 



* * 



* * 



■ * * 



184 



MLLFYLLWLSIDSTKASALTNPNVALFLLADDCDDPLVSALPQASFSSSSELS 
10 20 30 40 50 



60 70 80 90 100 110 

669 GSYSPGYAKINKRGGAGGWSPSDSDHYQWLQVDFGNRKQISAIATQGRYSSSDWVTQYRM 



184 SSHGPGFARLNRRDGAGGWSPLVSNKYQWLQIDLGERMEVTAVATQGGYGSSNWVTSYLL 
60 70 80 90 100 110 

120 130 140 150 160 170 

669 LYSDTGRNWKPYHQDGNIWAFPGNINSDGVVRHELQHPIIARYVRIVPLDWNGEGRIGLR 



* ■ 



184 MFSDSGWNWKQYRQEDSIWGFSGNANADSVVYYRLQPSIKARFLRFIPLEWNPKGRIGMR 
120 130 140 150 160 170 

180 190 20.0 210 220 230 

669 IEWGCSYWADVINFDGHVVLPYRFRNKKMKTLKDVIALNFKTSESEGVILHGEGQQGDY 



• . * • 



184 IEVFGCAYRSEWDLDGKSSLLYRFDQKSLSPIKDIISLKFKTMQSDGILLHREGPNGDH 
180 190 200 210 220 230 

240 250 260 . 270 280 290 

669 ITLELKKAKLVLSLNLGSNQLGPIYGHTSVMTGSLLDDHHWHSWIERQGRSINLTLDRS 



* • 



184 ITLQLRRARLFLLINSGEAKLPSTSTLVNLTLGSLLDDQHWHSVLIQRLGKQVNFTVDEH 
240 250 260 270 280 290 

300 310 320 330 340 350 

669 MQHFRTNGEFDYLDLDYEITFGGIPFSGKPSSSSRKNFKGCMESINYNGVNITDLARRKK 



• ■■•>.*■■•■ 



• . . • . 



184 RHHFHARGEFNLMNLDYEISFGGIPAPGKSVSFPHRNFHGCLENLYYNGVTDIIDLAKQQK 
300 310 320 330 340 350 



360 370 380 390 400 410 

gi | 669 LEPSNVGNLSFSCVEPYTVPV-FFNATSYLEVPGRLNQDLFSVSFQFRTWNPNGLLVFSH 
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* ■ ■ 



gi 1 184 PQIIAMGNVSFSCSQPQSMPVTFLSSRSYIiALPDFSGEEEVSATFQFRTWNKAGLLLFSE 

360 370 380 390 400 410 



gi 
gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 



420 430 440 450 460 470 

669 FADNLGNVEIDLTESKVGVHINITQTKMSQIDI S SGSGLNDGQWHEVRFLAKENF AILTI 

. . . . * ..»...•• . . . 
: . . . :...:. . : . : * * • 

184 LQLISGGILLFLSDGKL--KSNLYQPGKLPSDITAGVELNDGQWHSVSLSAKKNHLSVAV 
. "420 . 430 440 . 450 . 460 470 

480 490 500 510 520 530 

669 DGDEASAVRTNSPLQVKTGEKYFFGGFLNQMNNSSHSVLQPSFQGCMQLIQVDDQLVNLY 



* • # • 

■ • * * ■ 



184 DGQMASAAPLLGPEQIYSGGTYYFGGCPDKSFGSKCKSPLGGFQGCMRLISISGKWDLI 
480 490 500 510 520 530 

540 550 . 560 570 580 590 

669 EVAQRKPGSFANVSIDMCAIIDRCVPNHCEHGGKCSQTWDSFKCTCDETGYSGATCHNSI 
# # . . . ... . . 

184 SVQQGSLGNFSDLQIDSCGISDRCLPNYCEHGGECSQSWSTFHCNCTNTGYRGATCHNSI 

540 550 560 570 580 590 

600 610 620 630 640 650 

669 YEPSCEAYKHLGQTSNYYWIDPDGSGPLGPLKVYCNMTEDKVWTIVSHDLQMQTPWGYN 



* ■ 
■ * 



****** 



184 YEQSCEAYKHRGNTSGFYYIDSDGSGPLEPFLLYCNMTET-AWTIIQHNGSDLTRVROTN 

600 610 620 630 640 650 

660 670 680 690 700 710 

669 PEKYSVTQLWSASMDQI SAITDSAEYCEQWS YFCKMSRLLNTPDGS PYTWWVGKANEK 



****** 



184 PENPYAGFFEYVASMEQLQATINRAEHCEQEFTYYCKKSRLVNKQDGTPLSWWVGRTNET 

680 690 700 710 



660 



670 



720 730 740 750 760 770 

669 HYYWGGSGPGIQKCACGIERNCTDPKYYCNCDADYKQWRKDAGFLSYKDHLPVSQVWGD 

** » •• * •»*»•**• • • * • ** ••** • • 

* ******* »■•*»•»** •* ■ ••*» 

184 QTYWGGSS PDLQKCTCGLEGNC IDSQYYCNCDADRNEWTNDTGLLAYKEHLPVTKIVITD 

720 730 740 750 760 770 

780 790 800 810 820 830 

669 TDRQGSEAKLSVGPLRCQGDRNYWNAASFPNPSSYLHFSTFQGETSADI SFYFKTLTPWG 



* * 
■ * 



* * » • 



184 TGRLHSEAAYKLGPLLCRGDRSFWNSASFDTEAS YLHF PTFHGELS ADVSFFFKTTAS SG 

780 790 800 810 820 830 

840 850 860 870 880 890 

669 VFLENMGKEDF I KLELKS ATEVSFS FDVGNGPVE I WRS PTPLNDDQWHRVTAERNVKQA 



. • ■ a . • 



184 VFLENLG IADFIRIELRS PTWTF S FDVGNGPF E I SVQS PTHFNDNQWHHVRVERNMKEA 

840 850 860 870 880 890 

900 910 920 930 940 950 

669 SLQVDRLPQQIRKAPTEGHTRLELYSQLFVGG-AGGQQGFLGCIRSLRMNGVTLDLEERA 



***** 



* ■ • 



*•**•* 



gi 1 184 SLQVDQLTPKTQPAPADGHVLLQLNSQLFVGGTATRQRGFLGCIRSLQLNGMTLDLEERA 

900 910 920 930 940 950 

960 970 980 990 1000 1010 

gi | 669 KVTSGFISGCSGHCTSYGTNCENGGKCLERYHGYSCDCSNTAYDGTFCNKDVGAFFEEGM 
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gi 1 184 QVTPEVQPGCRGHCSSYGKLCRNGGKCRERPIGFFCDCTFSAYTGPFCSNEISAYFGSGS 

960 970 980 990 1000 1010 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



g 1 
gi 



gi 
gi 



1020 1030 1040 1050 1060 1070 

669 WLRYNFQAPATNARDSSSRVDNAPDQQNSHPD--LAQEEIRFSFSTTKAPCILLYISSFT 



184 SVIYNFQENYLLSKNSSSHA- ASFHGDMKLSREMIKFSFRTTRTPSLLLFVSSFY 

1020 1030 1040 . 1050 1060 

1080 1090 1100 1110 1120 1130 

669 TDFLAVLVKPTGSLQIRYNLGGTREPYNIDVDHRNMANGQPHSVNITRHEKTIFLKLDHY 



184 KEYLSVIIAKNGSLQIRYKLNKYQEPDVVOTDFKNMADGQLHHIMINREEGWFIEIDDN 
1070 1080 1090 1100 1110 1120 

1140 1150 1160 1170 1180 1190 

669 PSVSYHLPSSSDTLFNSPKSLFLGKVIETGKIDQEIHKYNTPGFTGCLSRVQFNQIAPLK 



» * ■ • 



184 RRRQVHL--SSGTEFSAVKSLVLGRILEHSDVDQETALAGAQGFTGCLSAVQLSHVAPLK 
1130 1140 1150 1160 1170 1180 

1200 1210 1220 1230 1240 1250 

669 AALRQTNASAHVHIQGELVESNCGASPLTLSPMSSATDPWHLDHLDSASADFPYNPGQGQ 

:::.... : . : : : . . : . : 
184 AALHPSHPDP-VTVTGHVTESSCMAQPGTDATSRERTHSFA-DH--SGTID-DREP 

1190 1200 1210 1220 1230 

1260 .. 1270 * 1280 . . 1290 1300 1310 

669 AIRNGVNRNSAIIGGVIAWIFTILCTLVFLIRYMFRHKGTYHTNEAKGAESAESADAAI 

. : . . . .::.:::.::::::.:: . . . : . . . . : : . . : : : .:...::.: . 
184 -LANAIKSDSAVIGGLIAWIFILLCITAIAVR-IYQQKRLYKRSEAKRSENVDSAEA-V 
1240 1250 1260 1270 1280 1290 

1320 1330 
669 MNNDPNFTETIDESKKEWLI 



184 LKSELNIQNAVNENQKEYFF 

1300 1310 



1331 residues in 1 query sequences 
1311 residues in 1 library sequences 

Scomplib [version .3. 3t05 March 30, 2000] . 

start: Mon Dec 16 15:45:32 2002 done: Mon Dec 16 15:45:34 2002 

Scan time: 0.050 Display time: 2.484 

Function used was FASTA 
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r 



FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 

T 

Please cite: 

W.R. Pearson & D.J. Lipman PNAS '(1988) 85:2444-2448 

/tmp/fastaKAA38aisO: 1311 aa 

>gi (18496979 |ref|NP_207837.1 | cell recognition protein CASPR4 

vs /tmp/fastaLAA48aisO library 
searching /tmp/f astaLAA48aisO library 



isoform 1; contactin 



1154 residues in 



1 sequences 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: opt 
gi | 16306509 jref | NP_387504 . 1 | cell recognition mol (1154) 3080 

»gi | 16306509 | ref |NP_387504 . 1 | cell recognition molecule (1154 aa) 

initn: 5264 initl: 2744 opt: 3080. 
Smith-Waterman score: 5135; 62.580% identity in 1256 aa overlap (33-1283:30-1150) 



gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 



gi 



gi 



gi 



10 20 30 40 50 60 

184 LFYLLWLSIDSTKASALTNPNVALFLLADDCDDPLVSALPQASFSSSSELSSSHGPGFA 

* 

» . . .... *.***. **#**«■**. 

... .».««»••«•«••«•••••••••••• 

163 MASVAWAVLKVLLLLPTQTWSPVGAGNPPDCDAPLASALPRSSFSSSSELSSSHGPGFS 

10 20 30 .40 50 

70 80 90 100 . 110 120 

184 RLNRRDGAGGWSPLVSNKYQWLQIDLGERME 

:::::::::::.::::::::::::::::::::::::::::::::.:: : : : : : : : : : . : : 
163 RLNRRDGAGGWTPLVSNKYQWLQIDLGERMEVTAVATQGGYGSSDWVTSYLLMFSDGGRN 
60 70 80 90 100 110 

130 140 150 160 170 180 

184 WKQYRQEDSIWGFSGNANADSWYYRLQPSIKARFLRFIPLEWNPKGRIGMRIEVFGCAY 

:::::.:.::::: ::.::::::.::::: ..::::::.:: 
163 WKQYRREESIWGFPGNTNADSVVHYRLQPPFEARFLRFLPLAWNPRGRIGMRIEVYGCAY 

120 130 140 150 160 170 

190 200 210 220 230 240 

184 RSEWDLDGKSSLLYRFDQKSLSPIKDIISLKFKTMQSDGILLHREGPNGDHITLQLRRA 



* * 



» • • * * 



163 KSEWYFDGQSALLYRLDKKPLKPIRDVISLKFKAMQSNGILLHREGQHGNHITLELIKG 
180 190 200 210 220 230 

250 260 270 280 290 300 

184 RIjFLLINSGEAKLPSTSTLVNLTLGSLLDDQHW^ 



* • . • * * * 



163 KLVTFLNSGNAKLPSTIAPVTLTLGSLLDDQHWHSV1,IELLDTQWFTVDKHTHHFQAKG 
240 250 260 270 280 . 290 

310 320 330 340 350 360 

184 EFNLMNLDYEISFGGIPAPGKSVSFPHRNFHGCLENLYYNGVDIIDLAKQQKPQIIAMGN 

. .•■■■•*• •■ • • ::::. ::: 

163 DSSYLDLNFEISFGGIPTPGRSRAFRRKSFHGCLENLYYNGWVTELAKKHKPQILMMGN 
300 310 320 330 340 350 



370 380 390 400 410 420 

gi 1 184 VSFSCSQPQSMPVTFLSSRSYLALPDFSGEEEVSATFQFRTWNKAGLLLFSELQLISGGI 
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gi 1 163 VSFSCPQPQTyPVTFLSSRSYIiALPGNSGEDKVSVTFQFRTWNRAGHLLFGELRRGSGSF 
360 370 380 390 400 410 



gi 



gi 



gi 



gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 



gi 



430 440 450 460 470 480 

184 LLFLSDGKLKS^YQPGKLPSDITAGVEL^ 

163 VLFLKDGKLKLSLFQPGQSPRNWAGAGLNDGQTO 

420 430, 440 450 460 470 

490 500 510 520 530 540 

184 LGPEQIY SGGTYYFGGC PDKSFGSKCKSPLGGFQGCMRLI S I SGKWDLI SVQQGSLGNF 



163 VAV-LIDSGDTYYFG 
480 490 



550 560 570 580 590 600 

184 SDLQIDSCGISDRCLPNYCEHGGECSQSWSTFHCNCTNTGYRGATCHNSIYEQSCEAYKH 



163 



163 



610 620 630 640 650 660 

184 RGNTSGFYYIDSDGSGPLEPFLLYCNMTETAWTI IQHNGSDLTRVRNTNPENPYAGF -FE 



DAAWTWQHGGPDAVTLRGAPSGHPRSAVSFA 
500 510 520 



670 680 690 700 710 720 

184 YVASMEQLQATINRAEHCEQEFTYYCKKSRLVNKQDGTPLSWWVGRTNETQTYWGGSSPD 



163 YAAGAGQLRSAVNLAERCEQRLALRCGTARRPDSRDGTPLSWWVGRTNETHTYWGGSLPD 
530 540 550 560 570 580 

730 740 750 760 770 780 

184 LQKCTCGLEGNCIDSQYYCNCDADRNEWTNDTGLLAYKEHLPVTKIVITDTGRLHSEAAY 



■•■ ••••••• 



163 AQKCTCGLEGNCIDSQYYCNCDAGRNEWTSDTIVLSQKEHLPVTQIVMTDTGQPHSEADY 
590 600 610 620 630 640 

790 800 810 820 830 840 

184 KLGPLLCRGDRSFWNSASFDTEASYLHFPTFHGELSADVSFFFKTTASSGVFLENLGIAD 



• • • * * 



163 TLGPLLCRGDQSFWNSASFNTETSYLHFPAFHGELTADVCFFFKTTVSSGVFMENLGITD 
650 660. 670 680 690 700 

850 860 870 880 890 900 

184 FIRIELRSPTVVTFSFDVGNGPFEISVQSPTHFNDNQWHHVRVERNMKEASLQVDQLTPK 



* • ■ » * * 



* * 

* * 



* «***••** 



163 FIRIELRAPTEVTFSFDVGNGPCEVTVQSPTPFNDNQWHHVRAERNVKGASLQVDQLPQK 
710 720 730 740 750 760 

910 ^ 920 930 940 950 960 

184 TQPAPADGHVLLQLNSQLFVGGTATRQRGFLGCIRSLQLNGMTLDLEERAQVTPEVQPGC 



« * • * * 



163 MQPAPADGHVRLQLNSQLFIGGTATRQRGFLGCIRSLQLNGVALDLEERATVTPGVEPGC 

790 800 810 820 



770 



780 



970 980 990 1000 1010 1020 

gi 1 184 RGHCSSYGKLCRNGGKCRERPIGFFCDCTFSAYTGPFCSNEISAYFGSGSSVIYNFQENY 
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gi 1 163 AGHCSTYGH1XRNGGRCREKRRGVTCDCAFSAYDGPFCSNEISAYFATGSSMTYHFQEHY 

830 840 850 860 870 880 
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gi 
gi 



gi 
gi 



gi 
gi 



gi 



gi 



1030 1040 1050 1060 1070 1080 

184 LLSKNSSSHAASFHGDMKLSREMIKFSFRTTRTPSLLLFVSSFYKEYLSVIIAKNGSLQI 



163 TLSENSSSLVSSLHRDWLTREMITLSFRTTRTPSLLLYVSSFYEEYLSVILANNGSLQI 
.890 900 910 920 930 940 

1090 1100 1110 1120 1130 1140 

184 RYKLNKYQEPDWNFDFKNMADGQLHHIMINREEGWFIEIDDNRRRQVHLSSGTEFSAV 



163 RYKLDRHQNPDAFTFDFKNMADGQLHQVKINREEAV^ 

950 960 970 980 990 1000 

1150 1160 1170 1180 1190 1200 

184 KSLVLGRILEHSDVDQETALAGAQGFTGCLSAVQLSHVAPLKAALHPSHPDPVTVTGHVT 



163 KSLILGKVLEAAGADPDTRRAATSGFTGCLSAVRFGRAAPLKAALRPSGPSRVTVRGHVA 
1010 1020 1030 1040 1050 1060 

1210 1220 1230 1240 1250 

184 E-SSCMAQPGTDATSRE RTHSFADHSGTIDDREPLANAIKSDSAVIGGLIAWIFIL 



* * 



* * * # 



* * 



163 PMARCAAGAASGSPARELAPRLAGGAGRSGPADEGEPLVNADRRDSAVIGGVIAWIFIL 
1070 1080 1090 1100 1110 1120 

1260 1270 1280 1290 1300 1310 

184 LCITAIAVRIYQQKRLYKRSEAKRSENVDSAEAVLKSELNIQNAVNENQKEYFF 



163 LCITAIAIRIYQQRKLRKENESKVSKKEEC 
1130 1140 1150 



1311 residues in 1 query sequences 
1154 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:31:46 2002 done: Mon Dec 16 15:31:47 2002 
Scan time: 0.050 Display time: 2.066 

Function used was FASTA 
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Cd 



FASTA searches a protein or DNA sequence data bank 

version 3.3t05 March 30, 2000 
Please cite; 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaGAAG7ai6z : 1477 aa 
>gi|14149613|ref |NP_004792.1| neurexin 1 isoform alpha precursor; neurexin I; 

vs /tmp/fastaHAAH7ai6z library 
searching /tmp/fastaHAAH7ai6z library ^ 



simil 



1331 residues in 



1 sequences 



-5)] ktup: 2 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.033 
The best scores are: °P t 
gi|6694278|gb|AAF25199.l|AFl93613_l cell recognit (1331) 376 

»gi|6694278|gb|AAF25199.1|AF193613_l cell recognition m (1331 aa) 

initn: 397 initl: 163 opt: 376 
Smith-Waterman score: 530; 24.015% identity in 1295 aa overlap (259-1469:165-1327) 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 



gi 



230 240 250 260 270 280 

141 LNGGVCSVVDDQAVCDCSRTGFRGKDCSQED3STNVEGLAHLMMGDQGKSKGKEEYIATFKG 

: : : : : . : : . . . : : 
669 AFPGNINSDGWRHELQHPIIARYVRIVPLDWNGEGRIGLRIEVYGCSYWAD--VINFDG 
140 150 160 170 180 190 

290 300 310 320 330 340 

141 SEYFGYDLSQNPIQSSSDEITLSFKTLQRNGLMLH-TGKSADYVNLALKNGAVSLVINLG 
■* . ;»... .... ■ *";..'.:.::: 

669 HWLPYRFRNKKMKTLKDVIALNFKTSESEGV^ 



200 



210 



220 



230 



240 



250 



350 360 370 380 390 400 

141 S G AF EALVE P VNGKF -NDNAWHDVKVTRNLRQH SGI GHAMVT I S VDG I LTTTG YTQE 



669 SNQLGPIYGHTSVMTGSLLDDHHWHSWIERQGRS 
260 270 280 



I NLTLDRSMQH F - RTNG 
290 300 



410 420 430 440 450 460 

141 DYTMLGSDDFFWGGSPSTADLPGSPVSNNFMGCLKEVVYKNNDVIU.ELSRLAKQGDPKM 



669 EFDYLDLDYEITFGGIPFSGK-PSSSSRKNFKGCMESINYNGVNIT-DLAR-RKKLEPSN 
310 320 330 340 350 360 

470 480 490 500 510 520 

141 KIHGWAFKCENVATLDPITFETPESFISLPKWNAKKTGSISFDFRTTEPNGLILFSH-- 



* • 



669 V--GNLSFSCVEPYTV-PVFFNAT-SYLEVPGRLNQDLFSVSFQFRTWNPNGLLVFSHFA 

370 380 390 400 410 

530 540 550 560 570 

141 -GKPRHQKDAKHPQM-IKVDFFAIEMLDGHLYLLLDMGSGTIKIKALLKKVMDGEWY^ 



***** 



669 DNLGNVEIDLTESKVGVHINITQTKMSQ 
420 430 440 



IDISSGS 
450 



■ GLNDGQWH EVR 
460 



630 



580 590 600 610 620 

gi 1 141 FQRDGRSGTISWTLRTPYTAPGESEILDLDDELYLGGLPENKAGLVFPTEVWTALLNYG 
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gi | 669 FLAKENFAILTIDGDEASAVRTNSPLQVKTGEKYFFGGFLNQMNNSSH 

470 480 490 500 510 



SVLQPS 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 



gi 



640 650 660 670 680 690 

141 YVGC IRDLFIDGQSKDI RQMAEVQ- - STAGVKPS - C SKETAKPCLSNPCKNNGMCRDGWN 



669 FQGCMQLIQVDDQLVNLYEVAQRKPGSFANVSIDMCA--IIDRCVPNHCEHGGKCSQTWD 
.520 530 540 . 550 560 570 



700 710 720 
141 R YVC DC S GTG YLGRS C E REATVLSY 



DGS 



730 
■MFMKIQLP 



* * 



669 SFKCTCDETGYSGATCHNSIYEPSCEAYKHLGQTSNYYWIDPDGSGPLGPLKVYCNMTED 
580 590 600 610 620 630 

740 750 760 770 780 

141 WMHTEAEDVSLRF RSQRAYG I - -LMATTSRD- - SADTLRLELDAGRVKLTVNLDC 



669 KVWTIVSHDLQMQTPWGYNPEKYSVTQLVYSASMDQI SAITDSAEYCEQYVS YFCKMS - 
640 650 660 670 680 690 

790 800 810 820 830 840 

141 IRINCNSSKGPETLFAGYT^NDNEWHTVRVVRRGKSL 

669 -RLLNTPDGSPYTWWVG KANEKH-YYWGGSGPGI QKCACGIERNCTDPKYY 

700 710 720 730 740 

850 860 870 880 890 900 

141 HNIETGIITERRYLSSVPSNFIGHLQSLTFNGMAYIDLCKNGD---IDYGELNARFGFRN 

:.. ... : • • • • • 

669 CNC D AD YKQWRK DAGFL S YKDHL PVS QVWGDTDRQGS EAKL SVG PLRC Q - GDRN 

750 760 770 780 790 

910 920 930 940 950 960 

141 IIADPVTFKTKSSYVALATLQAYTSMHLFFQFKTTSLDGLILYNSGDGNDFIWELVKGY 



* • 



■ • 



* * * • 



669 YW-NAASFPNPSSYLHFSTFQGETSADISFYFKTLTPWGVFLENMGK-EDFIKLELKSAT 
800 810 820 830 840 850 

970 980 990 1000 1010 

141 -LHYWDLGNGANLIKGSSNKPLNDNQWHNVMISROT 



• * 



* * * * # 



■ ** ***** 



* * 



669 EVSFSFDVGNGPVEIWRSPTPLNDDQWHRVTAERNVKQASLQVDRLPQQIRKAPTEGHT 
860 870 880 .. 890 900 910 

1020 1030 1040 1050 1060 1070 

141 NLDLKSDLYIGGVAKETYKSLPKLVHAKEGFQGCLASVDLNGRLPDLISDALFCNGQIER 



* • 



* • * * 



* * 



669 RLELYSQLFVGGAG GQQGFLGC IRSLRMNGVTLDLEERAKVTSGFI S - 

920 930 940 950 960 

1080 1090 1100 1110 1120 1130 

141 GCEGPSTTCQEDSCSNQGVCLQQWDGFSCDCSMTSFSGPLCN-DPGTTYIFSKGGG-QIT 



* • 



* * 



669 GCSGHCTSYGTN-CENGGKCLERYHGYSCDCSNTAYDGTFCNKDVGA--FFEEGMWLRYN 

970 980 990 1000 1010 



1140 

gi|141 YKWP- 



1150 1160 1170 

■NDRPSTRADRLAIGFSTVQKEAVLVRVDSSSGLGDYLE 
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gi | 669 FQAPATNARDSSSRVDNAPDQQNSHPDLAQEEIRFSFSTTKAPC ILLY I - - SSFTTDFLA 
1020 1030 1040 1050 1060 1070 



1180 



1190 



1200 



1210 



1220 



1230 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 
gi 



gi 



gi 



141 LHIHQ-GKIGVKFNVG--TDDIAIEESNAIINDGKYHWRFTRSGGNATLQVDSWPVIER 



669 VLVKPTGSLQIRYNLGGTREPYNIDVDHRNMANGQPHSVNITRHEKTIFLKLDHYPSVSY 
1080 1090 1100 1110 1120 1130 

1240 1250 1260 1270 1280 

141 Y P AGRQLT I FNS Q AT 1 1 1 G GK EQGQP-FQGQLSGLYYNGLKVLNMAAEN 



669 HLPSSSDTLFNSPKSLFLGKVIETGKIDQEIHKYNTPGFTGCLSRVQFNQIAPLK-AALR 
1140 1150 1160 1170 1180 1190 

1290 1300 1310 1320 1330 1340 

141 DANIAIVG^^VRLVGEVPSSMTTESTATAMQSEMSTSIMETTTTLATSTARRGKPPTKEPI 



« * 



669 QTNAS--AHVHIQGEL VESNCGA- 

1200 1210 



SPLTLSPM 
1220 



1350 1360 1370 1380 1390 1400 

141 SQTTDDILVASAECPSDDEDIDPCEPSSGGLANPTRAGGREPYPGSAEVIRESSSTTGMV 



669 S S ATDPWHLDHLDS AS ADF P YNP 
1230 1240 



-GQGQAIRNG 
1250 



-VNRNSAI IGGV 
1260 



1410 1420 1430 1440 1450 1460 

141 VGIVAAAALCILILLYAMYKYRNRDEGSYHVDESRNYISNSAQSNGAWKEKQPSSAKSS 



* • 



669 IAWIFTILCTLVFLI RYMFRHKGTYHTNEAKG- -AESAESADAAIMNNDPNFTETI 

1270 1280 1290 1300 1310 1320 

1470 

141 NKNKKNKDKEYYV 



669 DESKKEWLI 
1330 



1477 residues in 1 query sequences 
1331 residues in 1 library sequences 
Scomplib [version 3. 3t05 March 30, 2000] 

start: Mon Dec 16 16:01:21 2002 done: Mon Dec 16,16:01:22 2002 
Scan time: 0.033 Display time: 2.417 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 

T 

Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaGAAyHaq6z : 1477 aa 

>gi | 14149613 | ref | NP_004792 . 1 | neurexin 1 isoform alpha precursor; neurexin I; 

vs /tmp/fastaHAAzHaq6z library 
searching /tmp/f astaHAAzHaq6z library. 



simil 



1154 residues in 



1 sequences 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: opt 
gi | 16306509 | ref | NP_387504 . 1 | cell recognition mol (1154) 281 

»gi | 16306509 | ref |NP_387504 . 1 | cell recognition molecule (1154 aa) 

initn: 204 initl: 152 opt: 281 
Smith-Waterman score: 641; 22.746% identity in 1187 aa overlap (283-1437:183-1150) 



gi 



gi 



gi 



gi 



gi 



gi 
gi 



gi 



gi 



gi 



gi 



260 270 280 290 300 310 

141 KDCSQEDNNVEGLAHLMMGDQGKSKGKEEYIATFKGSEYFCYDLSQNPIQSSSDEITLSF 



163 RFLRFLPLAWNPRGRIGMRIEVYGCAYKSEWYFDGQSALLYRLDKKPLKPIRDVISLKF 

160 170 180 190 200 210 

320 330 340 350 360 

141 KTLQRNGLMLHT-GKSADYVm^KNGAVSLVINLGSGAFEALVEPVN- - -GKF-NDNAW 

::..:: : : : . : . . . : . • : : . : . . . : . : 

163 KAMQSNGILLHREGQHGNHITLELIKGKLVFFLNSGNAKLPSTIAPVTLTLGSLLDDQHW 

220 230 240 250 260 270 

370 380 390 400 410 420 

141 HDVKVTRNLRQHSGIGHAMVTISVDGILTTTGYTQEDYTMLGSDDFFYVGGSPSTADLPG 



163 HSVLIE 



- LLDTQVNFTVDK- HTHHFQAKGDS S YLDLNFE I S FGGI PT PG 

280 290 300 310 



430 440 450 460 470 480 

141 SPVS NNFMGCLKEVVYKN1TOV11LELSRLAKQGDPKMKIHGWAPKC 



.... 



163 RSRAFRRKSFHGCLENLYYNGVDV TELAKKHKPQILMMGNVSFSCPQPQTV-PVTF 

320 330 340 350 360 370 

490 500 510 520 530 540 

141 ETPESFISLPKWNAKKTGSISFDFRTTEPNGLILFSHGKPRHQKDAKHPQMIKVDFFAIE 



163 LSSRSYLALPGNSGEDKVSVTFQFRTWNRAGHLLF--GELRRGSGS 
380 390 400 410 



-FVLF 
420 



550 560 570 580 590 600 

141 MLDGHLYL-LLDMGSGTIKIKALLKKVOTGEV 



■ 4 



163 LKDGKLKLSLFQPGQSPRNVTAG-AGLI^GQWHSVSF 

430 440 450 460 470 



610 620 630 640 650 660 

gi 1 141 S E ILDLDDELYTjGGLPENKAGLVF PTEVWTALLNYGYVGC I RDLF I DGQS KDI RQMAEVQ 
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gi|!63 AVLIDSGDTYYFGD 
480 490 



■ AAWTWQHGGPDAVTLRGAP SGH PRS AVS FAYAA 
500 510 520 



gi 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 



gi 



670 680 690 700 710 720 

141 STAGVKPSCSKETAKPCLSNPCKNNGMCRDGWNRYVCDCSGTG YLGRSCEREATVLS 



* ■ 



163 GAGQLRSAVN- 
530 



IjAERCEQRIiALRCGTARRPDSRDGTPLSWWVGRTNETH T 

540 550 560 570 



730 740 750 760 770 780 

141 YDGSMFMKIQLPVVMHTEAEDVSLRFRSQRAYGILMATTSRDSADTLRLELDAGRVKLTV 



163 YWGG 
580 



SLP- 



DAQKCTCGL 
590 



790 800 810 820 830 

141 NLDCI- -RINCNSSKGPETLFAGYl^NDNEWHTVRVVRRGKSLKLTVDDQQAMT- -GQMA 



163 EGNC IDS Q YYC NC DAG 

600 



-RNEWTSDTIVLSQKE-HLPVT-QIVMTDTGQ 
610 620 630 



840 850 860 870 880 890 

141 GDHTRLEFHNIETGIITERRYLSSVPSNFIGHLQSLTFNGMAYIDLCKNGDIDYCELNAR 



> • • • 



163 -PHSEADYT- 
640 



LGPL- 



— LCR-GDQSF 
650 



900 910 920 930 940 950 

141 FGFRNIIADPVTFKTKSSYVALATLQAYTSMHLFFQFKTTSLDGLILYNSGDGNDFIWE 



* *■ 
* * * 



• * * * 



■ • 



163 



■WNSASFNTETSYLHFPAFHGELTADVCFFFKTTVSSGVFMENLGI-TDFIRIE 
660 670 680 690 700 



960 970 980 990 1000 1010 

141 l - VKG YLHYVF DLGNGANL I KGS SNKPLNDNQWHNVM I S RDT - - SNLHTVKI DTKI TTQ I 



* • 



163 LRAPTE^TFSFDVGNGPCEVTVQSPTPFNDNQWHHVRAERNVKGASLQVDQLPQKMQPAP 
710 720 730 740 750 760 

1020 1030 1040 1050 1060 1070 

141 TAGARNLDLKSDLYIGGVAKETYKSLPKLVHAKEGFQGCLASVDLNGRLPDLISDALFCN 



« * * * 



* * 



* * 

* * 



163 ADGHVRLQLNS QLF I GGTATR ■ 
770 780 790 



■QRGFLGCIRSLQLNGVALDLEERATVTP 
800 . 810 



1080 1090 1100 1110 1120 1130 

141 GQIERGCEGPSTTCQEDSCSNQGVCLQQWDGFSCDCSMTSFSGPLCNDPGTTYIFSKGGG 



• • • • 



• • • > • • 



* a « * * a 



163 G-VEPGCAGHCSTYGH-LCRNGGRCREKRRGVTCDCAFSAYDGPFCSNEISAYFAT--GS 
820 830 840 850 860 870 

1140 1150 1160 1170 1180 1190 

141 QITYKWPPNDRPSTRADRLAIGFSTVQKEAVLVRVDSSSGLGDYLELHIHQGKI-GVKFN 



163 SMTYHFQEHYTLSENSSSLV SSLHRDVTLTR- 

880 890 900 



EMITLSFRTTRTPSLLLY 
910 920 



1200 1210 1220 1230 1240 

gi 1 141 VGTDDIAIEESNAII - -NDGKYHV-VRFTRSGGNATLQVDSWPVIERYPAGR- -QLTIFN 



http://bioinformatics.lexgen.com/tools/fasta3.php3 



12/16/2002 



* Compare Genomic Sequences 




Page 3 of 3 



* * * 



gi 1 163 VSS FYEEYLSVILANNGSLQIRYKLDRHQNPDAFTFD FKNMADGQLHQVKINR 

930 940 950 960 970 



gi 
gi 



gi 



gi 



gi 



gi 



1250 1260 1270 1280 1290 1300 

141 SQATI I IGGKEQGQPFQGQL SGLYYNGLKVLNMAAENDANIAIVGNV-RLVGEVPSS 



163 EEAWMV EVNQSTKKQVILSSGTEFNAVKSL 

.980 990 1000 



-ILGKVLEAAGADPD- 
1010 1020 



1310 1320 1330 1340 1350 1360 

141 MTTESTATAMQSEMSTSIMETTTTLATSTARRGKPPTKEPISQTTDDI — LVASAECPSD 

. • .... * ■ : 

• * • • • . ....... • • * • • • • • 

163 -TRRAATSGFTGCLSAVRFGRAAPL--KAALRPSGPSRVTVRGHVAPMARCAAGAASGSP 

1030 1040 1050 1060 1070 

1370 1380 1390 1400 1410 1420 

141 DEDIDPCEPSSGGLANPTRAGGREPYPGSAEVIRESSSTTGMWGIVAAAALCILILLYA 

* * ••• • • • • • • • • 

... . . ... 

163 ARELAPRLAGGAGRSGPADEG--EPLVNAD RRDSAVIGGVIAWIFILLCITAIAIR 

1080 1090 1100 1110 1120 1130 

1430 1440 1450 1460 1470 

141 MYKYRN-RDEGSYHVDESRNYISNSAQSNGAVVKEKQPSSAKSSNKNKKNKDKEYYV 



* * 



163 IYQQRKLRKENESKVSKKEEC 
1140 1150 



1477 residues in 1 query sequences 
1154 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 16:02:13 2002 done: Mon Dec 16 16:02:14 2002 
Scan time: 0.050 Display time: 1.967 



Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaKAACHaq6z : 1477 aa 

>gi | 14149613 |ref |NP_004792 . 1 | neurexin 1 isoform alpha precursor; neurexin I; 

vs. /tmp/fastaLAADHaq6z library 
searching /tmp/f astaLAADHaq6z library. 



simil 



1311 residues in 



1 sequences 



-5)] ktup: 2 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: 

join: 40, opt:*28, gap-pen: -12/ -2, width: 16 

Scan time: 0.017 
The best scores are: °P t 
gi | 18496979 |ref | NP_207837 . 1 | cell recognition pro (1311) 403 

»gi | 18496979 | ref | NP_207837 . 1 | cell recognition protein (1311 aa) 

initn: 265 initl: 150 opt: 403 
Smith-Waterman score: 866; 25.392% identity in 1020 aa overlap (283-1229:186-1123) 



gi 

gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 



gi 



260 270 280 290 300 310 

141 KDCSQEDNNVEGLAHLMMGDQGKSKGKEEYIATFKGSEYFCYDLSQNPIQSSSDEITLSF 



184 RFLRFIPLEWNPKGRIGMRIEVFGCAYRSEWDLDGKSSLLYRFDQKSLSPIKDIISLKF 
160 170 180 190 200 210 

320 330 340 350 360 

141 KTLQRNGLMLHT-GKSADYVNLALK^ 



184 KTMQSDGILLHREGPNGDHITLQLRRARLFLLINSGEAKLPSTSTLVNLTLGSLLDDQHW 
220 230 240 250 260 270 

370 380 390 400 410 420 

141 HDVKVTRNLRQHSGIGHAMVTISVDGILTTTGYTQEDYTMLGSDDFFYVGGSPSTADLPG 



184 HSVLIQRLGKQ- 
280 



•WFTVDE-HRHHFHARGEFNLMNLDYEISFGGIPA PG 

290 300 310 320 



430 440 450 460 470 480 

141 S PVS — - NNFMGC LKEVVYKl^TDVRLEL S RLAKQGD PKMK I HGWAFKC ENVATLDP I TF 



* * 



• ■ * * 



• * 

* * * * 



* • * * » 



184 KSVSFPHRNFHGCLENLYYNGVDI IDLAKQQKPQIIAMGNVSFSCSQPQSM-PVTF 

.330 340. 350 360 370 

490 500 510 520 530 540 

141 ETPESFISLPKWNAKKTGSISFDFRTTEPNGLILFSHGKPRHQKDAKHPQMIKVDFFAIE 



* * 

* * 



184 LSSRSYLALPDFSGEEEVSATFQFRTWNKAGLLLFSEL- 
380 390 400 410 



QLISGGIL-LF 
420 



550 560 570 580 590 600 

141 MLDGHLYL-LLDMGSGTIKIKALLKKVTO 



* * 



184 LSDGKLKSNLYQPGKLPSDITAGV^-LNDGQWHSVSLSAKKNHLSVAVDG-QMASAAPLL 
430 440 450 460 470 480 



610 620 630 640 650 

gi 1 141 GESEILDLDDELYLGGLPENKAGLVFPTEVWTALLNYGYVGCIRDLFIDGQSKDI--RQM 
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gi 1 184 GPEQIYS-GGTYYFGGCPDKS FGSKCKSPL--GGFQGCMRLISISGKWDLISVQQ 

490 500 510 520 530 



gi 



gi 



gi 



gi 



gi 
gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



660 670 680 690 700 710 

141 AEVQSTAGVK-PSCSKETAKPCLSNPCKNNGMCRDGWNRYVCDCSGTGYLGRSC 



184 GSLGNFSDLQIDSCG--ISDRCLPNYCEHGGECSQSWSTFHCNCTNTGYRGATCHNSIYE 
-540 550 560 .570 580 590 

720 730 740 750 760 770 

141 REATVLSYDG--SMFMKIQLPVVMHTEAEDVSLRFRSQRAYGILMATTSRDSADTLRLEL 



* * 



• * * * 



184 QSCEAYKHRGNTSGFYYIDSDGSGPLEPFLLYCNM-TETAWTIIQ HNGSDLTRVRN 

600 610 620 630 640 



141 D 



780 790 800 

-AGRVKLTVNLDCIRINCNSSKGPETLFAGY- 



810 

•NLNDN EWHTVRW 



184 TNPENPYAGFFEWASMEQLQATINRAEHCEQEFTYYCKKSRLVNKQDGTPLSWWVGRTN 
650 660 670 680 690 700 



820 830 
141 RRGKSLKLTVDDQQAMTGQMAG- 



840 850 860 

DHTRLEFHNIETGIITERRYL — SSVP 



* * 



184 ETQTYWGGSSPDLQKCTCGLEGNCIDSQYYCNCDADRNEWTN-DTGLLAYKEHLPVTKIV 
710 .720 730 740 750 760 

870 880 890 900 910 . 920. 

141 SNFIGHLQSLTFNGMAYIDLCKNGDIDYCELNARFGFRNIIADPVTFKTKSSYVALATLQ 



184 ITDTGRLHS EAAYKLG PL - LCR- GDRS F 
770 780 790 



•WNSASFDTEASYLHFPTFH 
800 810 



930 940 950 . 960 970 980 

141 AYTSMHLFFQFKTTSLDGLILYNSGDGNDFIWELVKG-YLHYVFDLGNGANIilKGSSNK 



184 GELSADVSFFFKTTASSGVFLENLGIA-DFIRIELRSPTWTFSFDVGNGPFEISVQSPT 
820 830 840 850 860 870 

990 1000 1010 1020 1030 

141 PLNDNQWHNVMI SRD — TSNLHTVKIDTKITTQITAGARNLDLKSDLYIGGVAKETYKSL 



* * * 



* * * 



184 HFNDNQWHHVRVERNMKEASLQVDQLTPKTQPAPADGHVLLQLNSQLFVGGTATR- 
880 890 900 910 . 920 



1040 1050 1060 1070 1080 1090 

141 PKLVHAKEGFQGCLASVDLNGRLPDLISDALFCNGQIERGCEGPSTTCQEDSCSNQGVCL 



184 QRGFLGCIRSLQLNGMTLDLEERAQV-TPEVQPGCRGHCSSYGK-LCRNGGKCR 

930 940 950 960 970 



1100 1110 1120 1130 1140 
141 QQWDGFSCDCSMTSFSGPLCNDPGTTYIFSKGGGQITYKWPPN DRPSTRA- 



184 ERPIGFFCDCTFSAYTGPFCSNEISAYFGS--GSSVIYNFQENYLLSKNSSSHAASFHGD 
980 990 1000 1010 1020 1030 



1150 1160 1170 1180 1190 
gi 1 141 DRLAIGFS - -TVQKEAVLVRVDSSSGLGDYLELHI -HQGKIGVKFNVGT DDIA 
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gi 1 184 MKLSREMIKFSFRTTRTPSLLLFV--SSFYKEYLSVIIAKNGSLQIRYKI1NKYQEPDVVN 
1040 1050 1060 1070 1080 1090 

1200 1210 1220 1230 1240 1250 

gi 1 141 IEESNAI INDGKYHWRFTRSGGNATLQVDSWPVI ERYPAGRQLTI FNSQATI I IGGKEQ 



gi 1 184 FDFKN--MADGQLHHIMINREEGWFIEIDDNRRRQVHLSSGTEFSAVKSLVLGRILEHS 
1100 1110 1120 1130 1140 1150 



1477 residues in 1 query sequences 
1311 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 16:03:13 2002 done: Mon Dec 16 16:03:14 2002 
"Scan time: 0.017 Display time: 1.850 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 

version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaGAAsxa4Wh: 1642 aa 

>gi | 21166380 | ref | NP_620060 . 1 | neurexin 2, isoform alpha-2 precursor; neurexin II [H 

vs /tmp"/fastaHAAtxa4Wh library 
searching /tmp/f astaHAAtxa4Wh library 

1331 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 41, opt: 29, gap-pen: -12/ -2, width: 16 

Scan time: 0.066 
The best scores are: opt 
gi|6694278|gb|AAF25199.1|AF193613_l cell recognit (1331) 326 

»gi|6694278|gb|AAF25199.l|AFl93613_l cell recognition m (1331 aa) 

initn: 434 initl: 185 opt: 326 
Smith-Waterman score: 807; 23.942% identity in 1111 aa overlap (265-1285:187-1228) 

240 250 260 270 280 290 

gi | 211 GFGGKFCSEEEHPMEGPAHLTLNSEGKEEFVATFKGNEFFCYDLSHNPIQSSTDEITLAF 



gi | 669 RYVRIVPLDWNGEGRIGLRIEVYGCSYWADVINFDGHWL^ 

160 170 180 190 200 210 

300 310 320 330 340 

gi | 211 RTLQRNGLMLH-TGKSADYVNLSLKSGAVWLV^ 

.: . .:..:: • • • • 

gi | 669 KTSESEGVILHGEGQQGDYITLELKKAKLVliSLNLGSNQLGPIYGHTSVMTGSLLDDHHW 
220 230 240 250 260 270 

350 360 370 380 390 400 

gi | 211 HDVRVTRNLRQHAGIGHAMVTISVDGILTTTGYTQEDYTMLGSDDFFYIGGSPNTADLPG 



gi | 669 HSWIERQGRS INLTLDRSMQHF-RTNGEFDYLDLDYEITFGGIPFSGK-PS 

280 290 300 310 320 

410 420 430 440 450 460 

gi | 211 SPVSNNFMGCLKDVVYKNNDFKLELSRLAKEGDPKMKLQGDLSFRCEDVAA^ 



gi | 669 SSSRKNFKGCMESINYNGVT^IT-DLAR-RKKLEPSW--GNLSFSCVEPYTV-PVFFNAT 
330 340 . 350 360 370 380 

470 480 490 500 510 520 

gi | 211 EAFVALPRWSAKRTGSISLDFRTTEPNGLLLFSQGRRAGGGAGSHSSAQRADYFAMELLD 



:.:..::: .:::::.::. ... - ... ... . 



gi | 669 -SYLEVPGRLNQDLFSVSFQFRTWNPNGLLVFSHFADNLGNVEIDLTESKVGVH-INITQ 

390 400 410 420 430 

530 540 550 560 570 580 

gi | 211 GHL YLLLDMG S GG I KLRAS S RKVNDGEWC HVDFQRDGRKGS I S VNS RS T PFLATGD S E I L 



* « * * • • * 



gi | 669 TKMSQI-DISSGS GLNDGQWHEVRFLAKENFAI LTI DGDEAS AVRTNS PLQV 

440 450 460 470 480 490 

590 600 610 620 630 640 

gi | 211 DLESELYLGGLPEGGRVDLPLPPEVWTAALRAGWGCVRDLFIDGRSRDLRGLAEAQ-GA 
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gi|669 KTGEKYFFGGF 

500* 



LNQMNNSSHSVLQPSFQGCMQLIQVDDQLVNLYEVAQRKPGS 
510 520 530 540 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



650 660 670 680 690 700 
211 VGVAPFCSRETLKQCASAPCRNGGVCREGWNRFICDCIGTGFLGRVCER EATVLSY- 



669 FANVS I DMC AI I DRCVPNHC EHGGKC SQTWDS FKCTCDETGYSGATCHNS I YEPSC EAYK 
550 560 570 580 590 600 



211 



•DGS 



710 720 
• MYMK I ML PNAMHTEAED VS LRF 



730 

— MSQR 



* • * 



669 HLGQTSNYYWIDPDGSGPLGPLKVYCl^TEDKVWTIVSHDLQMQTPVVGYNPEKYSVTQL 
610 620 630 640 650 660 

740 750 760 770 780 

211 AYGLMMATTS - -RES ADTLRLELDG - GQMKLTVNLGKG - PETLFAGHKLNDNEWHTVRW 

. : . : : . : : . . . . ... . . 

669 VYSASMDQISAITDSAEYCEQWSYFCKMSRLLNTPDGSPYTWWVG-KANEKHYYWGG-- 

670 680 690 700 710 720 

790 800 810 820 830 840 

211 RRGKSLQLSVDNVTVEGQMAGAHMRLEFHNIETGIMTERRFISWPSNFIGHL--SGLVF 



669 -SGPGIQKCA- -CGIERNCTDPKY- - - YCNCDADYKQWRKDAGFL- -SYKDHLPVSQVW 

730 740 750 760 770 

850 860 870 880 890 900 

211 NGQPYMDQCKDGDITYCELNARFGLRAIVAD^ 



■ * 



669 GDTDR--QGSEAKLSVGPLRCQ-GDRNYW-NAASFPNPSSYLHFSTFQGETSADISFYFK 

780 790 800 810 820 

910 920 930 940 950 960 

211 TTAPDGLLLFNSGNGNDFIVIELVKGY-IHWFDLGNGPSLMKGNSDKPVNDNQWHNVVV 



669 TLTPWGVFLENMGK-EDFIKLELKSATEVSFSFDVGNGPVEIWRSPTPLNDDQWHRVTA 
830 840 850 860 870 880 

970 980 990 1000 1010 1020 

211 SRD--PGNVHTLKIDSRTVTQHSNGAIl^DLKGELYIGGLSKNMFSNLPKLVASRDGFQG 



669 ERNVKQASLQVDRLPQQIRKAPTEGHTRLELYSQLFVGG- 
890 900 910 . 920 



-AGGQQGFLG 
930 



1030 1040 1050 1060 1070 1080 

211 CLASVDLNGRLPDLIADALHRIGQVERGCDGPSTTCTEESCANQGVCLQQWDGFTCDCTM 



669 CIRSLRMNGVTLDLEERAKVTSGFIS-GCSGHCTSYGT-NCENGGKCLERYHGYSCDCSN 
940 950 960 970 980 990 



1090 1100 1110 

211 T S YGG PVCN - D PGTT Y I FGKGGAL I T YTW - - P 



1120 
P NDRPSTRMDR 



669 TAYDGTFCNKDVGA FFEEGMWLRYNFQAPATNARDSSSRVDNAPDQQNSHPDLAQEE 

1000 1010 1020 1030 1040 1050 



1130 1140 1150 1160 1170 1180 

gi | 211 LAVGF STHQRS AVLVRVD S ASGLGD YLQLH I DQ - GTVGV I FNVG - - TDD I T I DE PNA I VS 
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gi | 669 IRFSFSTTKAPCILLYISSFTT--DFIAVLVKPTGSLQIRYNLGGTREPYNIDVDHRNMA 

1060 1070 1080 1090 1100 

1190 1200 1210 1220 1230 

211 DGKYHWRFTRSGGNATLQVDSWP-VNERYPAGRQLTIFNSQAAIKIG GR-DQ-- 

* * .:.::: . . . : : . : : 

... . . ■ . • . . ■* • « « 

669 NGQPHSVNITRHEKTIFLKLDHYPSVSYHLPSSSD-TLFNSPKSLFLGKVIETGKIDQEI 
1110 1120 1130 . 1140 1150. 116-0 

1240 1250 1260 1270 1280 

211 GRP-FQGQVSGLYYNGLKVLALA— -AESDPNVRTEGHLRLVGEGPSVL-LSAETTA 



669 HKYNTPGFTGCLSRVQFNQIAPLKAALRQTNASAHVHIQGELVESNCGASPLTLSPMSSA 
1170 1180 1190 1200 1210 1220 

1290 1300 1310 1320 1330 1340 

211 TTLLADMATTIMETTTTMATTTTRRGRSPTLRDSTTQNTDDLLVASAECPSDDEDLEECE 

* 
* 

669 TDPWHLDHLDS AS ADFP YNPGQGQAIRNGVNRNS AI I GGVI AWI FT I LCTLVFL I RYMF 
1230 1240 1250 1260 1270 1280 



1642 residues in 1 query sequences 
1331 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:51:21 2002 done: Mon Dec 16 15:51:22 2002 
Scan time: 0.066 Display time: 2.117 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAAuHaq6z : 1642 aa 
>gi | 21166380 | ref |NP_620060 . 1 | neurexin 2, isoform alpha-2 precursor; neurexin II [H 

vs /tmp/fastaDAAvHaq6z library 
searching /tmp/f astaDAAvHaq6z library 

1154 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 41, opt: 29, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: 0 P fc 
gi|16306509|ref |NP_387504.1| cell recognition mol (1154) 262 

»gi | 16306509 | ref |NP_387504 . 1 | cell recognition molecule (1154 aa) 

initn: 372 initl: 151 opt: 262 
Smith-Waterman score: 660; 25.153% identity in 982 aa overlap (265-1202:183-986) 

240 250 260 270 280 290 

gi | 211 GFGGKFCSEEEHPMEGPAHLTLNSEGKEEFVATFKGNEFFCYDLSHNPIQSSTDEITLAF 



* ■ 



gi 1 163 RFLRFLPLAWNPRGRIGMRIEVYGCAYKSEWYFDGQSALLYRLDKKPLKPIRDVISLKF 

160 170 180 190 200 210 

300 310 .320 330 340 

gi | 211 RTLQRNGLMLHT-GKSADYVNLSLKSGA 



* * 



gi 1 163 KAMQSNGILLHREGQHGNHITLELIKGKLVFFLNSGNAKLPSTIAPVTLTLGSLLDDQHW 

220 230 240 250 260 270 

350 360 370 380 390 400 

gi | 211 HDVRWRNLRQHAGIGHAMVTISVDGILTTTGYTQEDYTMLGSDDFFYIGGSPNTADLPG 



•* . .*•■■•.• . .... 

.... • ••••«»•■ 



gi 1 163 HSVLIE LLDTQVNFTVDK-HTHHFQAKGDSSYLDLNFEISFGGIPT PG 

280 290 300 310 

410 420 430 440 450 460 

gi | 211 SPVS---NNFMGCLKDVVYKNNDFKLELSRLAKEGDPKMKLQGDLSFRCEDVAALD 



• * * 



gi 1 163 RS RAFRRKS FHGC L ENL YYNGVD VTELAKKHKPQILMMGNVSFSCPQPQTV-PVTF 

- 320 330 340 350 360 370 

470 480 490 500 510 520 

gi | 211 ESPEAFVALPRWSAKRTGSISLDFRTTEPNGLLLFSQGRRAGGGAGSHSSAQRADYFAME 



gi 1 163 LSSRSYLALPGNSGEDKVSVTFQFRTWNRAGHLLFGELRR GSGS FVLF 

380 390 400 410 420 

530 540 550 560 570 580 

gi | 211 LLDGHLYL-LLDMGSGGIKLRASSRKVNDGEWCHVDFQRDGRKGSISVNSRST--PFLAT 



* . * ... . 



... • • • 

• a... ... 



gi 1 163 LKDGKLKLSLFQPGQSPRNVTAGA-GLNDGQWHSVSFSAK^ 

430 440 450 460 470 480 

590 600 610 620 630 640 

gi | 211 GD S E I LDLE S EL YLGGL P EGGRVDL PL P PEVWTAALRAGYVGCVRDL F I DGRS RDLRGLA 
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gi|163 — 



— LIDSGDTYYFGD- 
490 



-AAWTWQHGGPDAVTLRGAPSGHPRSAVSFA 
500 510 520 



gi 



gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 



gi 



650 660 670 680 690 

211 EAQGA VGVAPFCSRETLKQCASA--PCRNGGVCREGWNRFICDCIGTGFLGRVC 



163 YAAGAGQLRSAVNLAERCEQRLALRCGTARRPDSRDGTPLSWW- 
•■' 530 540 550 560 



-VGRTN 
570 



700 710 720 730 740 750 

211 EREATVLSYDGSMYMKIMLPNAMHTEAEDVSLRFMSQRAYGLMMATTSRESADTLRLELD 



163 E THTYWGGS 

580 



LPDAQKCTC 
590 



■GL- 



EGNCIDS 



760 770 780 790 800 810 

211 GGQMKLTVNLGKGPETLFAGHKLNDNEWHTVRW 



* * 



163 --QYYCNCDAGR- 
600 



— NEWTSDTIVLSQKE-HLPVTQI 
610 620 630 



820 830 840 850 860 870 

211 EFHNIETGIMTERRFISWPSNFIGHLSGLVFNGQPYMDQCKDGDITYCELNARFGLRAI 



163 VMTD TGQPH S EADYTLG PLLC R - GDQS F 

640 650 

, 880 890 900 910 920 930 

211. VAQPVTFKSRS S YLALATLQAYASMHLFFQFKTTAPDGLLLFNSGNGNDFI VI EL - VKGY 

. ..:....::: : : : 

163 W-NSASFNTETSYLHFPAFHGELTADVCFFFKTTVSSGVFMENLGI-TDFIRIELRAPTE 

660 670 680 690 700 710 

940 950 960 970 980 990 
211 IHYVFDLGNGPSLMKGNSDKPVNDNQWHNVWSRDPGNVHTLKID S RTVTQH SNGAR 

. . ::.:::: . . : :::::::.:.:. ...:..: . . - . : 

163 WFSFDVGNGPCEVTVQSPTPFNDNQWHHVRAERNVKGA-SLQVDQLPQKMQPAPADGHV 

720 730 740 750 760 770 

1000 1010 1020 1030 1040 1050 

211 NLDLKGELYIGGLSKNMFSNLPKLVASRD-GFQGCIiASVDLNGRLPDLIADALHRIGQVE 



163 RLQLNSQLFIGG TATRQRGFLGCIRSLQLNGVALDLEERATVTPG-VE 

780 790 . 800 . 810 820 

1060 1070 1080 1090 1100 

211 RGCDGPSTTCTEESCANQGVCLQQWDGFTCDCTMTSYGGPVCNDPGTTYIFGKGGALIT- 



* * 



* * 



* * 



« • * * 



163 PGC AGHC STYGHL - CRNGGRCREKRRGVTCDC AF S AYDGPFC SNE I S AY- FATGS SMTYH 

830 840 850 860 870 



1110 
211 YTWPPN- 



1120 1130 1140 1150 

DRPSTRMDRLAVGFSTHQRSAVLVRVDSASGLGDYLQLHI 



163 FQEHYTLSENSSSLVSSLHRDVTLTR-EMITLSFRTTRTPSLLLYVSSF--YEEYLSVIL 
880 890 900 910 920 930 



1160 1170 1180 1190 1200 1210 

gi | 211 -DQGTVGVIFNV GTDD I T I DE PNA I VSDGK YHWRFTRS GGNATLQVD SW PVNER Y 
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gi 1 163 ANNGSLQIRYKLDRHQNPDAFTFDFKN--MADGQLHQVKINREEAVVMVEWQSTKKQVI 
940 950 960 970 980 990 

1220 1230 1240 1250 1260 1270 

gi | 211 PAGRQLTIFNSQAAIKIGGRDQGRPFQGQVSGLYYNGLKVLALAAESDP1WRTEGHLRLV 

gi 1 163 LSSGTEFNA\^SLIIX?KVLEAAGADPDTRRAATSGFTGCLSAVRFGRAAPLKAALRPSGP 
-1000 1010 .1020 1030 1040 1050 



1642 residues in 1 query sequences 
1154 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:53:54 2002 done: Mon Dec 16 15:53:55 2002 
Scan time: 0.050 Display time: 1.700 

Function used was FASTA 
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version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 
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/tmp/fastaGAAe9aWNs: 1642 aa 
>gi | 21166380 | ref | NP_620060 . 1 | neurexin 2 
vs /tmp/fastaHAAf 9aWNs library 

searching /tmp/fastaHAAf 9aWNs library 



isoform alpha-2 precursor; neurexin II [H 



1311 residues in 



1 sequences 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 41, opt: 29, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are : °Pt 
gi | 18496979 | ref | NP_207837 . 1 | cell recognition pro (1311) 523 

»gi|18496979|ref | NP_207837 . 1 | cell recognition protein (1311 aa) 

initn: 483 initl: 170 opt: 523 
Smith-Waterman score: 877; 25.980% identity in 1020 aa overlap (265-1202:186-1123) 



g 1 
gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 



gi 



240 250 260 270 280 290 

211 GFGGKFCSEEEHPMEGPAHLTLNSEGKEEFVATFKGNEFFCYDLSHNPIQSSTDEITLAF 



184 RFLRFIPLEWNPKGRIGMRIEVFGCAYRSEWDLDGKSSLLYRFDQKSLSPIKDIISLKF 
160 170 180 190 200 210 

300 310 320 330 340 
211 RTLQRNGLMLHT-GKSADYVNLSLKSGAVWLVINLGSGAFEALVEPVN. GKF-NDNAW 



184 KTMQSDGILLHREGPNGDHITLQLRRARLFLLINSGEAKLPSTSTLVNLTLGSLLDDQHW 
220 230 240 250 260 270 

350 360 370 380 390 400 

211 HDVRVTRNLRQHAGIGHAMVTISVDGILTTTGYTQEDYTMLGSDDFFYIGGSPNTADLPG 



• ■ * * * 



184 HSVLIQRLGKQ- 
280 



■VNFTVDE - HRHHFHARGEFNLMNLD YE I S FGG I PA PG 

290 300 310 320 



410 420 430 440 450 460 

211 SPVS---NNFMGCLKDVWKNNDFKLELSRLAKEGDPKMKLQGDLSFRCEDVAALDPVTF 



* • 



■ # * 



184 KSVSFPHRNFHGCLENLYYNGVD 1 IDLAKQQKPQI IAMGNVSFSC SQPQSM- PVTF 

330 340 .350. 360 370 

470 480 490 500 510 520 

211 ESPEAFVALPRWSAKRTGSISLDFRTTEPNGLLLFSQGRRAGGGAGSHSSAQRADYFAME 



* • ■ * • 



184 LSSRSYLALPDFSGEEEVSATFQFRTWNKAGLLLFSELQLISGG 
380 390 400 410 420 



ILLF 



530 540 550 560 570 580 

211 LLDGHLYL-LLDMGSGGIKLRASSRKVNDGEWCHVDFQRDGRKGSISVNSR STPFLA 



* • * 



* * 



184 LSDGKLKSNLYQPGKLPSDITAGV-ELNDGQWHSVSLSAKKNHLSVAVDGQMASAAPLL 
430 440 450 460 470 480 



590 600 610 620 630 

gi | 211 TGDSEILDLESELYLGGLPE GGRVDLPLPPEVWTAALRAGYVGCVRDLFIDGRSRDL 
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gi | 184 -GPEQIYS-GGTYYFGGCPDKSFGSKCKSPL 

490 500 510 



■GGFQGCMRLI S I SGKWDL 
520 530 



690 



gi 



gi 
gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 



gi 



640 650 660 670 680 

211 RGLAEAQGAVGVAPFCSRETL— KQCASAPCRNGGVCREGWNRFICDCIGTGFLGRVCE 



* * 



184 — ISVQQGSLGNFSDLQIDSCGISDRCLPNYCEHGGECSQSWSTFHCNCTNTGYRGATCH 

540 . 550 560 570 580 

700 710 720 730 740 

2ii r EATVLSY DGSMYMKI MLPNAMHTEAEDVSLRFMSQRAYGLM-MAT 



* * 



184 NSIYEQSCEAYKHRGNTSGFYYIDSDGSGPLEPFLLYCNMTETAWTIIQHNGSDLTRVRN 
590 600 610 620 630 640 

750 760 770 . 780 
211 TSRES- -ADTLRLELDGGQMKLTVNLGKGPETLFAGH KLNDNE WHTVRW 

184 TNPENPYAGFFEYVASMEQLQATINRAEHCEQEFTYYCKKSRLVNKQDGTPLSWWVGRTN 

660 670 680 690 700 



650 



790 800 810 

211 RR GKS LQL S VDNV TVEGQMAGAHM - 



820 830 
RLEFHNIETGIMTERRFISWPSN 



* * 



184 ETQTYWGGSSPDLQKCTCGLEGNCIDSQYYCNCDADRNEWTN-DTGLLAYKEHLPV--TK 

720 730 740 750 760 



710 



890 



840 850 860 870 880 

211 FIGHLSGLVFNGQPY^QCKDGDITYCELNARFGLRAIVADPVTFKSRSSYLALATLQAY 



• • • 



184 IVITDTGRLHSEAAY KLGPLL-CRGDRSFW NSASFDTEASYLHFPTFHGE 

780 790 800 810 



770 



950 



900 910 920 930 940 

211 ASMHLFFQFKTTAPDGLLLFNSGNGNDFIVIELVKG-YIHYVFDLGNGPSLMKGNSDKPV 



* * 



184 LSADVSFFFKTTASSGVFLENLGIA-DFIRIELRSPTWTFSFDVGNGPFEISVQSPTHF 

830 840 850 860 870 



820 



1010 



960 970 980 990 1000 

211 NDNQWHNVWSRDPGNVHTLKIDS - - - RTVTQH SNG ARNLDLKGEL Y I GGL S KNMF SNL P 



184 NDNQWHHVRVERNMKEA-SLQVDQLTPKTQPAPADGHVLLQLNSQLFVGG- 
880 .890 900 910 9.20 



1070 



1020 1030 1040 1050 1060 

211 KLVASRD-GFQGCLASVDLNGRLPDLIADALHRIGQVERGCDGPSTTCTEESCANQGVCL 



• * • • 



• * 



184 — TATRQRGFLGCIRSLQLNGMTLDLEERA-QVTPEVQPGCRGHCSSYGKL-CRNGGKCR 

930 940 950 960 970 

1080 1090 1100 1110 1120 
211 QQWDGFTCDCTMTSYGGPVCNDPGTTYIFGKGGALITYTWPPN— DRPST 

: : ::::...: : : : - - • • : : • • : • • : • 

184 ERPIGFFCDCTFSAYTGPFCSNEISAY-FGSGSSVI-YNFQENYLLSKNSSSHAASFHGD 

990 1000 1010 1020 1030 



980 



1170 



1130 1140 1150 1160 
gi | 211 -RMDRLAVGFS--THQRSAVLVRVDSASGLGDYLQLHIDQ-GTVGVIFNVGT DDIT 
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.... . . . . • . . . . . . . 

gi 1 184 MKLSREMIKFSFRTTRTPSLLLFVSSF--YKEYLSVIIAKNGSLQIRYKLNKYQEPDVVN 
1040 1050 1060 1070 1080 1090 

1180 1190 1200 1210 1220 1230 

gi | 211 IDEPNAIVSDGKYHWRFTRSGGNATLQVDSWPVNERYPAGRQLTIFNSQAAIKIGGRDQ 

. • » » • • 

. . * ••••• • • • » ••••• 

gi 1 184 FDFKN--MADGQLHHIMINREEGWFIEIDDNRRRQVHLSSGTEFSAVKSLVLGRILEHS 
-1100 1110 1120 1130 1140 . 1150 



1642 residues in 1 query sequences 
1311 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:54:59 2002 done: Mon Dec 16 15:55:00 2002 
Scan time: 0.050 Display time: 1.917 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaKAAwxa4Wh: 1392 aa 

>gi|23498650|emb|CAC87720.2 | neurexiri 3-alpha [Homo sapiens] 

vs /tmp/fastaLAAxxa4Wh library- 
searching /tmp/fastaLAAxxa4Wh library 



1331 residues in 



1 sequences 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.034 
The best scores are: opt 
gi|6694278|gb|AAF25199.l|AF193613_l cell recognit (1331) 429 

»gi|6694278|gb|AAF25199.1|AF193613_l cell recognition m (1331 aa) 

initn: 419 initl: 192 opt: 429 
Smith-Waterman score: 868; 26.006% identity in 1019 aa overlap (255-1196:186-1142) 



gi 
gi 



gi 
gi 



gi 



gi 



gi 
gi 



gi 



gi 



230 240 250 260 270 280 

234 STTGYGGKLCSEGLSHLMMSEQGRSKAREENVATFRGSEYLCYDLSQNPIQSSSDEITLS 



669 ARYVRIVPLDWNGEGRIGLRIEVYGCSYWADVINFDGHWLPYRFRNKKMKTLKDVIALN 
160 170 180 190 200 210 

290 .300 310 . 320 330 

2 34 FKTWQRNGLILH-TGKSADYVNLALKDGAVSLVINLGSGAFEAI - --VEPVNGKF-NDNA 



669 FKTSESEGVILHGEGQQGDYITLELKKAKLVLSLNLGSNQLGPIYGHTSVMTGSLLDDHH 
220 230 240 250 260 270 

340 350 360 370 380 390 

234 WHDVKVTRNLRQVTISVDGILTTTGYTQEDYTMLGSDDFFYVGGSPSTADLPGSPVSNNF 



669 WHSWIERQGRSINLTLDRSMQHF-RTNGEFDYLDLDYEITFGGIPFSGK-PSSSSRKNF 
280 290 300 310 320 330 

400 410 420 430 440 450 

234 MGCLKEWYKNNDIRLELSRLARIADTKMKIYGEWFKCENVATLDPINFETPEAYISLP 



669 KGCMESINYNGVNIT-DLARRKKLEPSNV GNLSFSCVEPYTV-PVFFNAT-SYLEVP 

340 350 . 360 370 380 

460 470 480 490 500 510 

234 KWNTKRMGSISFDFRTTEPNGLILFTHGKPQERKDARSQKNTKVDFFAVELLDGNLYLLL 



669 GRLNQDLFSVSFQFRTWNPNGLLVFSHFADNLGNVEIDLTESKVGVH-INITQTKMSQI- 
• 390 400 410 420 430 440 

520 530 540 550 560 570 

234 DMGSGTIKVKATQKKANDGEWYHVDIQRDGRSGTISVNSRRTPFTASGESEILDLEGDMY 



• . ■ • 



669 DISSGS GLNDGQWHEVRFLAKENFAILTIDGDEASAVRTNSPLQVKTGEKYF 

450 460 470 480 490 



580 590 600 610 620 630 

gi | 234 LGGLPENRAGLILPTELWTAMLNYGWGCIRDLFIDGRSKNIRQLAEMQNAAGVKSSCSR 
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gi|669 FGGF 
500 



•LNQMNNSSHSVLQPSFQGCMQLIQVDDQLVNLYEVAQRKPGSFANVSID 
510 520 530 540 550 



gi 
gi 



gi 
gi 



gi 



gi 



gi 
gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



640 650 660 670 680 
234 MSA--KQCDSYPCKNNAVCKDGWNRFICDCTGTGYWGRTCE REASILSY- 



669 MCAIIDRCVPNHCEHGGKCSQTWDSFKCTCDETGYSGATCHNSIYEPSCEAYKHLGQTSN 

560 570 580 590 600 610 

690 700 710 720 

234 DGS MYMKI IMPMVMHTEAEDVSFRFMSQRAYGL- -LVATTSRD 



669 YYWIDPDGSGPLGPLKVYCNMTEDKVWTIVSHDLQMQTPVVGYNPEKYSVTQLVYSASMD 

620 630 640 650 660 670 

730 740 750 760 770 

234 - - SADTLRLELDGGRV KL- -MVNLGKG- PETLYAGQKLNDNEWHTVRWRRGKSLK 

: : : : : : . . . : :::..:. : : : : . . . 

669 QISAITDSAEYCEQYVSYFCKMSRLLNTPDGSPYTWWVGKA NEKH-YYWGGSGPGIQ 

680 690 700 710 720 

780 790 800 810 820 830 

234 LTVDDDVAEGTMVGDHTRLEFHNIETGIMTEKRYISWPSSFIGHLQSLMFNGLLYIDLC 



669 



KCACGIERNCTDPKYYCNCDADYKQWRK- 
730 740 750 



DAGFLSYKDHLPVSQVWGDTD 
760 770 



840 850 860 . 870 880 890 

234- KNGD- — IDYCELKARFGLRNIIADPVTFKTKSSYLSLATLQAYTSMHLFFQFKTTSPDG 



669 RQGSEAKLSVGPLRCQ-GDRNYW-NAASFPNPSSYLHFSTFQGETSADISFYFKTLTPWG 
780 790 800 810 820 830 

900 910 920 930 940 950 

234 FILFNSGDGNDFIAVELVKGY-IHYVFDLGNGPNVIKGNSDRPLNDNQWHNVVITRDNSN 



669 VFLENMGK-EDFIKLELKSATEVSFSFDVGNGPVEIWRSPTPLNDDQWHRVTAER-NVK 
840 850 860 870 880 890 

960 970 980 990 1000 1010 

234 THSLKVD TKVVTQVINGAKNLDLKGDLYMAGLAQGMYSNLPKLVASRDGFQGCLASV 



* * * • 



* * 



669 QASLQVDRLPQQIRKAPTEGHTRLELYSQLFVGG-AGG 

900 910 . • 920 



— QQGFLGCIRSL 
930. 940 



1020 1030 1040 1050 1060 1070 

234 DLNGRLPDLINDALHRSGQIERGCEGPSTTCQEDSCANQGVCMQQWEGFTCDCSMTSYSG 



* * 



• ■ • 



669 RMNGVTLDLEERAKVTSGFIS-GCSGHCTSYGTN-CENGGKCLERYHGYSCDCSNTAYDG 

950 960 970 980 990 

1080 1090 1100 1110 

234 NQCN-DPGATYIFGKSGGLILYTW- -PA NDRPSTRSDRLAVGF 



• * 



***** •**• 



669 TFCNKDVGA FFEEGMWLRYNFQAPATNARDSSSRVDNAPDQQNSHPDLAQEEIRFSF 

1000 1010 1020 1030 1040 1050 



1120 1130 1140 1150 1160 

gi | 234 STTVKDGILVRIDSAPGLGDFLQLHIEQ-GKIGWFNIGTV— DISIKEERTPVNDGKYH 
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i | 669 STTKAPCILLYISSFTT--DFLAVLVKPTGSLQIRYNLGGTREPYNIDVDHRNMANGQPH 
1060 ' 1070 1080 1090 1100 1110 



1170 1180 1190 1200 1210 1220 

gi | 234 WRFTRNGGNATLQVDNWP-VNEHYPTGNTDNERFQ 



gi | 669 SVNITRHEKTIFLKLDHYPSVSYHLPSSSDTLFNSPKSLFLGKVIETGKIDQEIHKYNTP 
. 1120 1130 1140 1150 . 1160 1170 



1392 residues in 1 query sequences 
1331 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:59:10 2002 done: Mon Dec 16 15:59:11 2002 
Scan time: 0.034 Display time: 1.833 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite : 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAAC7ai6z: 1392 aa 
>gi|23498650|emb|CAC87720.2| neurexin 3-alpha [Homo sapiens] 

vs /tmp/fastaDAAD7ai6z library 
searching 7tmp/fastaDAAD7ai6z library 



Page 1 of 3 



1154 residues in 



1 sequences 



-5)] ktup: 2 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: opt 
gi|16306509|ref |NP_387504.1| cell recognition mol (1154) 240 

»gi [16306509 | ref|NP_387504.1 1 cell recognition molecule (1154 aa) 

initn: 327 initl: 166 opt: 240 
Smith-Waterman score: 616; 24.612% identity in 967 aa overlap (251-1184:178-986) 



gi 



gi 



gi 



gi 



gi 



gi 



gi 
gi 



gi 



gi 



230 240 250 260 270 280 

234 TCDCSTTGYGGKLCSEGLSHLMMSEQGRSKAREENVATFRGSEYLCYDLSQNPIQSSSDE 



163 PPFEARFLRFLPLAWNPRGRIGMRIEVYGCAYKSEVVYFDGQSALLYRLDKKPLKPIRDV 
150 160 170 180 190 200 

. 290 300 310 320 330. 

234 ITLSFKTWQRNGLILHT-GKSADYVNLALKDGAVSLVINLGSGAFEAIVEPW GKF- 
... ... • • • • .::*...■••■ ... 

163 ISLKF^QSNGILLHREGQHGNHITLELIKGKLVFFLNSGNAKLPSTIAPVTLTLGSLL 

220 230 240 250 260 



210 



390 



340 350 360 370 380 

234 NDNAWHDVKWRNLRQVTISVDGILTTTGYTQEDYTMLGSDDFFYVGGSPSTADLPGSPV 

. . ... * • 

.:.::.:. ::...:: : ..:... . . .... 

163 DDQHWHSVLIELLDTQVNFTVDK-HTHHFQAKGDSSYLDLNFEISFGGIPT PGRSR 

280 290 300 310 320 



270 



450 



400 410 420 430 440 

234 S NNFMGCLKEVVYKNNDIRLELSRLARIADTKMKIYGEVVFKCENVATLDPINFETP 



163 AFRRKSFHGCLENLYYNGVDV TELAKKHKPQILMMGNVSFSCPQPQTV-PVTFLSS 

. 330 340 350 360 370 

460 470 480 490 500 510 

234 EAYISLPKWNTKRMGSISFDFRTTEPNGLILFTHGKPQERKDARSQKNTKVDFFAVELLD 



• • • » 



163 RSYLALPGNSGEDKVSVTFQFRTWNRAGHLLFG ELRRGSGS 

380 390 400 410 



-FVLFLKD 
420 



520 530 540 550 560 

234 GNLYL-LLDMGSGTIKVKATQKKANDGEWYHVDIQRDGRSGTISVNSRRT--PFTASGES 



* m * * • 



163 GKLKLSLFQPGQSPRNVTAGAG-LNDGQWHSVSFSAKWSHMNVWDDDTAVQPLVA- 

440 450 460 470 480 



430 



600 



610 



620 



570 580 590 

gi | 234 EILDLEGDMYLGGLPENRAGLILPTELWTAMLNYGYVGCIRDLFIDGRSKNIRQLAEMQN 
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gi|l63 -VLIDSGDTYYFG 

490 



DAAWTWQHGGPDAVTLRGAPSGHPRSAVSFAYAAG 
500 510 520 



680 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



gi 



630 640 650 660 670 

234 AAGVKSSCSRMSAKQCDSYPCKNNAVCKDGWNRFICDCTGTGYW-GRTCEREASILSYDG 



163 AGQLRSAVNL— AERCEQRLALRCGTARRPDSR- - -DGTPLSWWVGRTNETHTY- - -WGG 

550 560, 570 . 580 



530 



540 



730 



740 



690 700 710 720 

234 SMYMKI IMPMVMHTEAEDVSFRFMSQRAYGLLVATTSRDSADTLRLELDGGRVKLMVNLG 



163 SL- 



DAQKCTC 
590 



GL - - EGNC I D S Q - - YYCNCDAGR- 

600 



800 



750 760 770 780 790 

234 KGPETLYAGQKLNDNEVfflTVRVVRRGKSLKLTVDDDVAEGTMVGD-HTRLEFHNIETGIM 



163 



— NEWTSDTIVLSQKE-HLPVTQIVMTDT— GQPHSEADYT- 
610 620 630 640 

860 



810 820 830 840 850 

234 TEKRYISWPSSFIGHLQSLMFNGLLYIDLCKNGDIDYCELKARFGLRNIIADPVTFKTK 



163 



LGPL- 



— LCR-GDQSF 
650 



WNSASFNTE 

660 

870 880 890 900 910 920 

234 SSYLSLATLQAYTSMHLFFQFKTTSPDGFILFNSGDGNDFIAVEL-VKGYIHYVFDLGNG 

• •••• • *; t ; ; ; , ; ; , ••■•>>>• 

163 TS^HFPAFHGELTADVCFFFKTWSSGVFMENLGI-TDFIRIELRAPTEVTFS 

690 700 710 720 



670 



680 



970 



980 



930 940 950 960 
234 PNVIKGNSDRPLNDNQWHNWITRDNSNTHSLKVDT KWTQVINGAKNLDLKGDLYM 

: . . : :.::::::.: : : . : : . : : : • • : 

163 PCEVTVQSPTPFNDNQWHHVRAER-NVKGASLQVDQLPQKMQPAPADGHVRLQLNSQLFI 

740 750 760 770 780 



730 



1040 



990 1000 1010 1020 1030 

234 AGLAQGMYSNLPKLVASRDGFQGCLASVDLNGRLPDLINDALHRSGQIERGCEGPSTTCQ 



• • • 



163 GGTA TRQRGFLGC I RSLQLNGVALDLEERATVTPG - VEPGC AGHC STYG 

800 810 820 830 



790 



1090 



1100 



1050 1060 1070 1080 

234 EDSCANQGVCMQQWEGFTCDCSMTSYSGNQCNDPGATYIFGKSGGLILYTWPANDRPSTR 



163 H-LCRNGGRCREKRRGVTCDCAFSAYDGPFCSNEISAYF— ATGSSMTYHFQEHYTLSEN 

850 860 870 880 



840 



1150 



1110 1120 1130 1140 

234 SDRLAVGFS TTVKDGILV--RIDSAPGLGDFLQLHIEQGKIGVVFNIGTVDISIKEE 



* * 



* * 



■ » * 



163 SSSLVSSLHRDVTLTREMITLSFRTTRTPSLLLYVSSFYEEYLSVILANNGSLQIRYKLD 

910 920 930 940 



890 



900 



1160 

gi | 234 R— TP 



1200 



1170 1180 1190 
VNDGKYHWRFTRNGGNATLQVDNWPVNEHYPTGNTDNERFQMVKQ 
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gi 1 163 RHQNPDAFTFDFKNMADGQLHQVKINREEAWMVEVNQSTKKQVILSSGTEFNAVKSLIL 
950 960 970 980 990 1000 

1210 1220 1230 1240 1250 1260 

gi | 234 KIPFKYNRPVEEWLQEKGRQLTIFNTQAQIAIGGKDKGRLFQGQLSGLYYDGLKVLNMAA 

gi 1 163 GKVLEAAGADPDTRRAATSGFTGCLSAVRFGRAAPLKAALRPSGPSRVTVRGHVAPMARC 
1010-\ 1020 1030 1040 .1050 1060 



1392 residues in 1 query sequences 
1154 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:58:10 2002 done: Mon Dec 16 15:58:11 2002 
Scan time: 0.050 Display time: 1.567 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 

version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaKAAgTaWrn: 1392 aa 

>gi|23498650|emb|CAC87720.2| neurexin 3-alpha [Homo sapiens] 

vs /tmp/fastaLAAhTaWrn library 
searching /tmp/f astaLAAhTaWrh library 



1311 residues in 



1 sequences 



FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)J ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.066 
The best scores are: opt 
gi|l8496979|ref | NP_207837 . 1 | cell recognition pro (1311) 498 

»gi 1 18496979 fref |NP_207837 . 1 1 cell recognition protein (1311 aa) 

initn: 303 initl: 163 opt: 498 
Smith-Waterman score: 862; 25.411% identity in 1035 aa overlap (251-1209:181-1145) 



gi 

gi 



gi 



gi 



gi 



gi 



gi 
gi 



gi 



gi 



230 240 250 260 270 280 

234 TCDCSTTGYGGKLCSEGLSHLMMSEQGRSKAREENVATFRGSEYLCXDLSQNPIQSSSDE 



184 PSIKARFLRFIPLEWNPKGRIGMRIEVFGCAYRSEWDLDGKSSLLYRFDQKSLSPIKDI 

160 170 180 190 200 210 

290 300 310 320 330 

234 IT^SFKTWQRNGLILHT-GKSADYVNLALKDGAVSLVINLGSGAFEAIVEPVN---GKF- 
.......... . .. .... .. * 

■ ••a... . ..*... ■ ••••••* .. . « •••• . . • • • » ... 

184 ISLKFKTMQSDGILLHREGPNGDHITLQLRRARLFLLINSGEAXLPSTSTLVNLTLGSLL 

220 230 240 250 260 270 

340 350 360 370 380 390 

234 NDNAWHDVKVTRNLRQVT I S VDG I LTTTG YTQED YTMLG SDDF F YVGG S P S TADL PG S P V 
. : . ::.: . : .::.*.:: ... *...*. . . .. .. ... . 

184 DDQHWHS\njIQRLGKQVNFTVDE-HRHHFHARGEFNLMNLDYEISFGGIPA PGKSV 

280 290 300 310 320 

400 410 420 430 440 450 

234 s NNFMGCLKEWYKNNDIRLELSRLARIADTKMKIYGEWFKCENVATLDPINFETP 



184 S F PHRNFHGC L ENL YYNGVD I - 1 DLAKQQK PQIIAMGNVSFSCSQPQSM-PVTFLSS 

330 .340 350 . 360 . 370 380 

460 470 480 490 500 510 

234 EAYISLPKWNTKRMGSISFDFRTTEPNGLILFTHGKPQERKDARSQKNTKVDFFAVELLD 



184 RSYLALPDFSGEEEVSATFQFRTWNKAGLLLFS 

390 400 410 



ELQLISGGILLFLSDGKLK 
420 430 



520 530 540 550 560 

234 GNLY LLLDMGSGTIKVKATQKKANDGEWYHVDIQRDGRSGTISVNSRRTPFTASGE 



184 SNL YQPGKL P S D I T AGV ■ 
440 



- -ELNDGQWHSVSLSAKKNHLSVAVDGQMASAAPLLG 
450 460 470 480 



570 580 590 600 610 620 

gi | 234 S E I LDL EGDMYLGGL P ENRAGL I L PTELWTAMLNYG YVGC I RDLF I DGRS KNI - - RQLAE 
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gi | 184 PEQIYSGGTYYFGGCPDKSFGSKCKSPLG 

490 500 510 



GFQGCMRLI S I SGKWDLI SVQQGS 
520 530 



gi 
gi 



gi 



gi 



gi 
gi 



gi 



gi 



gi 
gi 



gi 
gi 



gi 
gi 



gi 



gi 



630 640 650 660 670 680 

234 MQNAAGVK-SSCSRMSAKQCDSYPCKNNAVCKDGWNRFICDCTGTGYWGRTCE---REAS 



184 LGNFSDLQIDSCG-ISDRCLPNY-CEHGGECSQSWSTFHCNCTNTGYRGATCHNSIYEQS. 
540 550 560 - 570 .580 590 

690 700 710 720 730 

234 ILSY DGSMYMKI IMPMVMHTEAEDVSFRFMSQRAYGLL-VATTSRDSAD 



184 CEAYKHRGNTSGFYYIDSDGSGPLEPFLLYCNMTETAWTIIQHNGSDLTRVRNTNPENPY 
600 610 620 630 640 650 



740 750 760 
234 TLRLELDGGRVKLMVNLGKGP ETLYAGQK LNDNE- 



770 
-WHTVRWRR- 



184 AGFFEYVASMEQLQATINRAEHCEQEFTYYCKKSRLVNKQDGTPLSWWVGRTNETQTYWG 
660 ■ 670 680 690 700 710 

780 790 800 810 820 

234 GKSLKLTVDDDVAEGTMVG DHTRLEFHNIETGIMTEKRYJSV--VPSSFIGHL 



• • • • 



• • ■ • 



184 GSSPDLQKCTCGLEGNCIDSQYYCNCDADRNEWTN-DTGLLAYKEHLPVTKIVITDTGRL 
720 730 740 750 760 770 

830 840 . 850 860 870 , 880 

234 QS^MFNGLLYIDLCKNGDIDYCELKARFGLRNIIADPVTFKTKSSYLSLATLQAYTSMHL 



184 HSEAAYKLGPL-LCR-GDRSF 
780 790 



-WNSASFDTEASYLHFPTFHGELSADV 
800 810 820 



890 900 910 920 930 940 

234 FFQFKTTSPDGFILFNSGDGNDFIAVELVKG-YIHYVFDLGNGPNVIKGNSDRPLNDNQW 



• • • ■ 



184 SFFFKTTASSGVFLENLGIA-DFIRIELRSPTWTFSFDVGNGPFEISVQSPTHFNDNQW 

830 840 850 860 870 

950 960 970 980 990 1000 

234 HNWITRDNSNTHSLKVD TKWTQVI NGAKNLDLKGDL YMAGLAQGMY SNL P KLVA S 



* • 



TR 



184 HHVRVER-NMKEASLQVDQLTPKTQPAPADGHVLLQLNSQLFVGGTA 

880 890 900 910 -920 

1010 1020 1030 1040 1050 1060 

234 RDGFQGCLASVDLNGRLPDLINDALHRSGQIERGCEGPSTTCQEDSCANQGVCMQQWEGF 



4 • 



• * 



4 ■ 



184 QRGFLGCIRSLQLNGMTLDLEERA-QVTPEVQPGCRGHCSSYGK-LCRNGGKCRERPIGF 
930 940 950 960 970 980 

1070 1080 1090 1100 1110 

234 TCDCSMTSYSGNQCNDPGATYIFGKSGGLILYTWPANDRPSTRSDRLAVGFSTTVKDG-- 

:::....:.: : . . . . : : : : : . . . : . . : : : • ' * • : • : • 
184 FCDCTFSAYTGPFCSNEISAY-FG-SGSSVIYNFQENYLLSKNSSSHAASFHGDMKLSRE 

990 1000 1010 1020 1030 1040 



1120 1130 1140 1150 1160 
g i | 234 ILVRIDSAPGLGDFLQLHIEQGKIGWFNIGTVDISIK EERTPVN- 
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gi 1 184 MIKFSFRTTRTPSLLLFVSSFyKEYLSVIIAKNGSLQIRYKLNKYQEPDVVNFDFKNMAD 

1050 1060 1070 1080 1090 1100 

1170 1180 1190 1200 1210 1220 

gi | 234 GKYHWRFTRNCK3NATLQVDNWPVNEHYPTC 



* * 



gi 1 184 GQLHHIMINREEGWFIEIDD NRRRQVHLSSGTEFSAVKSLVLGRILEHSDVDQETA 

' 1110 1120 1130 1140 1150 1160 

1230 1240 1250 1260 1270 1280 

gi | 234 GRQLTIFNTQAQIAIGGKDKGRLFQGQLSGLYYDGLKVLNMAAENNPNIKINGSVRLVGE 

gi 1 184 LAGAQGFTGCLSAVQLSHVAPLKAALHPSHPDPVTVTGHVTESSCMAQPGTDATSRERTH 

1170 1180 1190 1200 1210 1220 



1392 residues in 1 query sequences 
1311 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:57:28 2002 done: Mon Dec 16 15:57:29 2002 
Scan time: 0.066 Display time: 1.817 

* 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAAmhaaYv: 1345 aa 
>FIRST_SEQUENCE 

vs /tmp/fastaDAAnhaaYv library 
searching /tmp/fastaDAAnhaaYv library 

1331 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: . 0.050 
The best. scores are: °P t 
gi|6694278|gb|AAF25199.l|AFl93613_l cell recognit (1331) 4412 

»gi|6694278|gb|AAF25199.l|AFl93613_l cell recognition m (1331 aa) 

initn: 3862 initl : 2022 opt: 4412 
Smith-Waterman score: 4577; 50.724% identity in 1313 aa overlap (66-1345:30-1331) 

40 50 60 70 80 90 

FIRST_ I NL I KEMD S L PRLTS VLTLLF SGLWHLGLTATNYNC DD PLAS LL S PfclAF S S S S DLTGTH S 



* * 



gi I 669 MQAAPRAGCGAALLLWIVSSCLCRAWTAPSTSQKCDEPLVSGLPHVAFSSSSSISGSYS 

10 20 30 40 50 

100 110 120 130 140 150 

FIRST_ P--AQLNWRVGTGGWSPADSNAQQWLQMDLGNRVEITAVATQGRYGSSPWVTSYSLMFSD 
* . » .-• «•«.. ». *:.:. :....■•.....•••••"*""•""*" -.«•.. 
gi I 669 PGYAKINKRGGAGGWSPSDSDHYQWLQVDFGNRKQISAIATQGRYSSSDWVTQYRMLYSD 
60 70 80 90 100 110 

160 170 180 190 200 210 

FIRST_ TGRNWKQYKQEDSIWTFAGNMNADSVVHHKLLHSVRARFVRFVPLEWNPSGKIGMRVEVY 



• • « ••• I.I 



gi I 669 TCRl^PYHQDGNIWAFPGNINSDGVVRHELQHPIIARYVRIVPLDWNGEGRIGLRIEVY 
120 130 140 150 160 170 

220 230 240 250 260 270 

FIRST_ GCSYKSDVADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLE 



• a • • 



gi I 669 GCSYWADVINFDGHWLPYRFRNKKMKTLKDVIALNFKTSESEGVILHGEGQQGDYITLE 
.180 190 200 210 220 . 230 

280 290 300 310 320 330 

FIRST_ LQKGRLALHLNLGDSKARLSSSLPSATLGSLLDDQHWH-VLIERVGKQVNFTVDKHTQHF 



. ■•...* ••« • ••«.• • • * 

... • .« ••••• 



gi I 669 LKKAKLVLSLNLGSNQLGPIYGHTSVMTGSLLDDHHWHSWIERQGRSINLTLDRSMQHF 
240 250 260 270 280 290 

340 350 360 370 380 390 

FIRST. RTKGETDALDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLYYNGVNII-LAKRRKHQIY 

• ...... * * 

. . •• . ... ••••• ...^^ .II. ••••••» .,.«•• . 

gi I 669 RTNGEFDYLDLDYEITFGGipFSGKPSSSSRKNFKG 

300 310 320 330 340 350 

400 410 420 430 440 450 

FIRST TVGNVTFSCSEPQIVPITFNSSGSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEG 
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gi | 669 WGNLSFSCVEPYTVPVFFNAT-SYLEVPGRLNQDLFSVSFQFRTWNPNGLLVFSHFADN 
360 370 380 390 400 410 

460 470 480 490 500 

FIRST_ SGTLLLSL EGG I - LRLVI QKMTERVAE I LTG SNLNDGLWH S VS I NARRNR I TLTLDD 



gi | 669 LGWEIDLTESKVGVHINITQTKMSQ--IDISSGSGIJ^GQWHEVRFLAKENFAILTIDG 
42-0 430 440 450 460 470 

510 520 530 540 550 560 

FIRST. EAAPPAPDSTWVQIYSGNSYYFGGCPDNLTDSQ--CLNPIKAFQGCMRLIFIDNQPKDLI 

• • • •••• • ..•••**..•••• 

. . . •■ .*• ....... ... 

gi | 669 DEASAVRTNSPLQVKTGEKYFFGGFLNQMNNSSHSVLQP--SFQGCMQLIQVDDQLVNLY 
480 490 500 510 520 530 

570 580 590 600 610 620 

FIRST. SVQQGSLGNFSDLHIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSI 

. . ■ ••••• ••• • • • • • • "****;.. 

gi I 669 EVAQRKPGSFANVSIDMCAIIDRCVPNHCEHGGKCSQTWDSFKCTCDETGYSGATCHNSI 

540 550 560 . 570 580 590 

630 640 650 660 670 680 

FIRST_ YEQSCEVYRHQGNTAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGAN 



• ■ • • 



gi | 669 YEPSCEAYKHLGQTSNYl^IDPDGSGPLGPLKVYCI^TEDKVWTIVSHDLQMQTPWGTO 

600 610 620 630 640 650 

690 . 700 710 .720 730. 740 

FIRST. PEKPYAMALDYGGSMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTIPDGTPFTWWIGRSNER 



■ ■ • 



gi I 669 PEKYSVTQLVYSASMDQISAITDSAEYCEQYVSYFCKMSRLLNTPDGSPYTWWVGKANEK 

660 670 680 690 700 710 

750 760 770 780 790 800 

FIRST HPYWGGSPPGVQQCECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPVTQIVITD 



* * * ■ * * • 

***** * * 



gi | 669 HYYWGGSGPGIQKCACGIERNCTDPKYYCNCDADYKQWRKDAGFLSYKDHLPVSQVWGD 

720 730 740 750 760 770 

810 820 830 840 850 860 

FIRST TDRSNSEAAWRIGPLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSG 



* * * 
***** 



gi I 669 TDRQGSEAKLSVGPLRCQGDRNYWNAASFPNPSSYLHFSTFQGETSADISFYFKTLTPWG 
, 780 790 800 810. 820 830 

870 880 890 900 910 920 

FIRST_ VFLENLGIKDFIRLEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWHYVRAERNLKET 



* * 



gi I 669 WLENMGKEDF I KLELKS ATEVS F SFDVGNGPVE I WRS PTPLNDDQWHRVTAERNVKQA 

840 850 860 870 880 890 

930 940 950 960 970 980 

FIRST. SLQVDNLPRSTRETSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLHLNGQKMDLEERA 

*■•**•* • »■**••••• •* '••*■' 

***** •* • *■• * I * I ********** **••*••••••*••* ******* 

gi I 669 SLQWRLPQQIRKAPTEGHTRLELYSQLFVGGAGG-QQGFLGCIRSLRMNGVTLDLEERA 

900 910 920 .930 940 950 

990 1000 1010 1020 1030 1040 

FIRST. KVTSGVRPGCPGHCSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGT 
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gi I 669 KVTSGFISGCSGHCTSYGTNCENGGKCLERYHGYSCDCSNTAYDGTFCNKDVGAFFEEGM 

960 970 980 990 1000 1010 

1050 1060 1070 1080 1090 
FIRST. SVTYMFQEPYPVTKNISLSSSAIYTDSAPSKEN IALSFVTTQAPSLLLFIN 

. . . . . • • » • • 

. • . • • • • • ***••• ..•••••••»*•••• 

gi I 669 WIiRYNFQAP ATNARDS S S RV - - DNAPDQQNS H PDLAQE E IRFSFSTTKAPCI LLY I S 

• 1020 1030 1040 1050 1060 

1100 1110 1120 1130 1140 1150 

FIRST_ SSSQDFVWLLCKNGSLQVRYHLN-KEETHVFTIDADNFANRRMHHLKINREGRELTIQM 

• • • • • • * • • • » • 

gi | 669 S FTTDFLA^VKPTGSLQI RYNLGGTREP YNI DVDHRNMANGQPHSVNI TRHEKTI FLKL 
1070 1080 1090 1100 1110 1120 

1160 1170 1180 1190 1200 1210 

FIRST. DQQLRLS YNF — S PEVEFRVI RSLTLGKVTENLGLDS EVAKANAMGFAGCMS S VQYNHI A 



gi I 669 DHYPSVSYHLPSSSDTLFNSPKSLFLGKVIETGKIDQEIHKYNTPGFTGCLSRVQFNQIA 
1130 1140 1150 1160 1170 1180 

1220 1230 1240 1250 1260 
FIRST_ PLKAALRHATV-APVTVHGTLTESSCGF - -MVDSDVNAVTT -VHSSSDPFG-KTD 



• • • r r .... 



gi I 669 PLKAALRQTNASAHVHIQGELVESNCGASPLTLSPMSSATDPWHLDHLDSASADFPYNPG 
1190 1200 1210 1220 1230 1240 

1270 1280 1290 1300 1310 1320 

FIRST EREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMTRFLYQHKQSHRTSQMKEKEYPENLD 



# * 



gi I 669 QGQAIRNGVNRNSAIIGGVIAWIFTILCTLVFLIRYMFRHKGTYHTNEAKGAESAESAD 
1250 1260 1270 1280 1290 1300 

1330 1340 
FIRST_ SSF-RNEIDLQNTVSECKREYFI 



gi|669 AAIMNNDPNFTETIDESKKEWLI 
1310 1320 1330 



1345 residues in 1 query sequences 
1331 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30,. 2000] 

start: Mon Dec 16 15:40:18 2002 done: Mon Dec 16 15:40:19 2002 
Scan time: 0.050 Display time: 2.484 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaGAAZ8aisO: 1345 aa 
>FIRST_SEQUENCE 

vs /tmp/fastaHAA08aisO library 
searching /tmp/f astaHAA08aisO library 

1154 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.050 
The best scores are: °P t 
gi| 16306509 | ref|NP_387504.1 | cell recognition mol (1154) 2301 

»gi | 16306509 | ref | NP_387504 . 1 | cell recognition molecule (1154 aa) 

initn: 3683 initl: 1805 opt: 2301 
Smith-Waterman score: 3850; 47.578% identity in 1280 aa overlap (51-1317:11-1152) 



30 40 50 60 70 80 

FIRST_ PTPBETAANDNEREXIl^IKEMDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLS 



gi|163 



MASVAWAVLKVLLLL PTQTW S PVGAGNP PDCDAPLAS ALP 
10 20 30 40 



90 100 . 110 120 130 

F I RST PMAF S S S SDLTGTHS P - - AQLNWRVGTGGWS PAD SNAQQWLQMDLGNRVE I TAVATQGRY 



.•■* ••....**•. ...» . ------- 

gi i 163 RSSFSSSSELSSSHGPGFSRLNRRDGAGGWTPLVSNKYQWLQIDLGERMEVTAVATQGGY 

50 60 70 80 90 100 



140 150 160 170 180 190 

FIRST GSSDWVTSYSLMFSDTGRNWKQYKQEDSIWTFAGl^ADSVVHHKLLHSVRARFVRFVPL 



; ; ; ; ; 

gi 1 163 GSSDWVTSYLLMFSDGGRNWKQYRREESIWGFPGNTNADSWHYRLQPPFEARFLRFLPL 

110 120 130 140 150 160 



200 210 220 230 240 250 

FIRST EWNPSGKIGMRVEVYGCSYKSDVADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGV 



. ■*•*■*. *• 



gi 1 163 AWNPRGRIGMRI EVYGC AYKSEVVYFDGQSALLYRLDKKPLKPI RDVI SLKFKAMQSNGI 

170 180 . 190 200 210 220 



260 270 280 290 300 310 

FIRST_ LFHGEGQRGDHITLELQKGRLALHLNLGDSKARLSSSL-P-SATLGSLLDDQHWH-VLIE 



• *■••*> 



gi 1 163 LLHREGQHGNHITLELIKGKLVFFLNSGN--AKLPSTIAPVTLTLGSLLDDQHWHSVLIE 

230 240 250 260 270 



320 330 340 350 360 370 

FIRST. RVGKQVKFTVDKHTQHFRTKGETDALDIDYELSFGGI PVPGKPGTFLKKNFHGC I ENLYY 



i 1 163 LLDTQWFTVDKHTHHFQAKGDSSYLDLNFEISFGGIPTPGRSRAFRRKSFHGCLENLYY 
280 290 300 310 320 330 



380 390 400 410 420 430 

FIRST NGVNII-LAKRRKHQIYTVGNVTFSCSEPQIVPITFNSSGSYLLLPGTPQIDGLSVSFQF 
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gi 1 163 NGVDWEUUqCHKPQILMMGWSFSCPQPQWPVTFLSSRSYLALPGNSGEDKVSWFQF 
340 350 360 370 380 390 

440 450 460 470 480 490 

FIRST_ RTWNKDGLLLSTELSEGSGTLLLSLEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSI 



* • 



gi 1 163 RTWNRAGHLLFGELRRGSGSFVLFLKDGKLKLSLFQPGQSPRNVTAGAGLNDGQWHSVSF 
40O 410 420 430 440 450 

500 510 520 530 540 550 

FIRST NARRNRITLTLDDEAAPPAPDSTWVQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRL 



gi 1 163 SAKWSHMNVWDDDTA- -VQPLVAVLIDSGDTYYFG 

460 470 480 490 

560 570 580 590 600 610 

FIRST. IFIDNQPKDLISVQQGSLGNFSDLHIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDT 

gi|163 



620 630 640 650 660 670 

FIRST_ SYTGATCHNSIYEQSCEVYRHQGNTAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHN 

* * * • * • 

* * * * * * • 

ai | 163 DAAWTWQHG 

1 .500 

680 690 700 710 720 730 

FIRST_ NTELTRVRGANPEKPY-AMALDYGGSMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGT 

•** * * ***** ••*#•••'*•'•* ** ** • • • 

■ « * • * * • • • 

gi 1 163 GPDAVTLRGAPSGHPRSAVSFAYAAGAGQLRSAVNLAERCEQRLALRCGTARRPDSRDGT 

510 520 530 540 550 560 

740 750 760 770 780 790 

FIRST_ PFTWWIGRSNERHPYWGGSPPGVQQCECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFK 

« ** ** ** * ■*•■* * * * I » ■ *■*•**■* *•■***»* ** * 

gi 1 163 PLSWWVGRTNETHTYWGGSLPDAQKCTCGLEGNCIDSQYYCNCDAGRNEWTSDTIVLSQK 

570 580 590 600 610 620 

800 810 820 830 840 850 

FIRST_ DHLPVTQIVITDTDRSNSEAAWRIGPLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSAD 



* * • • • • • 



gi 1 163 EHLPVTQIVMTDTGQPHSEADYTLGPLLCRGDQSFWNSASFNTETSYLHFPAFHGELTAD 

630 640 . 650 660 670 680 

860 870 880 890 900 910 

FIRST. ISFFFKTTALSGVFLENLGIKDFIRLEISSPSEITFAIDVGNGPVELWQSPSLLNDNQW 

»■••• •••• • • • • • •••••• • • • • • 

gi 1 163 VCFFFKTTVSSGVFMENLGITDFIRIELRAPTEVTFSFDVGNGPCEVTVQSPTPFNDNQW 

690 700 710 720 730 740 

920 930 940 950 960 970 

FIRST_ HYVRAERNLKETSLQVDNLPRSTRETSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLH 

gi 1 163 HHV^ERNVKGASLQVDQLPQKMQPAPADGHVRLQLNSQLFIGGTATRQRGFLGCIRSLQ 

750 760 770 780 790 800 

980 990 1000 1010 1020 1030 

FIRST. LNGQKMDLEERAKVTSGVRPGCPGHCSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFC 
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gi 1 163 LNGVALDLEERATVTPGVEPGCAGyCSTYGHLCRNGGRCREKRRGVTCDCAFSAYDGPFC 

810 820 830 840 850 860 

1040 1050 1060 1070 1080 1090 

FIRST_ KKEVSAVFEAGTSVTYMFQEPYPVTKNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLL 



gi 1 163 SNEISAYFATGSSMTYHFQEHYTLSENSSSLVSSLHRDVTLTREMITLSFRTTRTPSLLL 

870 880 890 900 910 920 

1100 1110 1120 1130 1140 1150 

FIRST. FINSSSQDFVVVLLCKNGSLQVRYHLNKEET-HVFTIDADNFANRRMHHLKINREGRELT 



* * * 



gi 1 163 WSSFYEEYLSVILANNGSLQIRYKLDRHQNPDAFTFDFKNMADGQLHQVKIimEEAVVM 

930 940 950 960 970 980 

1160 1170 1180 1190 1200 1210 

FIRST_ I QMDQQLRLS YNF S PEVEFRVI RSLTLGKVTENLGLDS EVAKANAMGFAGCMS S VQ YNH I 

. * . . ....... • • • • ••»•• 

...... . . . . . . . . • . • • « • • • •••••••..«••• 

gi 1 163 VEVNQSTKKQVILSSGTEFNAVKSLILGKVLEAAGADPDTRRAATSGFTGCLSAVRFGRA 

990 1000 1010 1020 1030 1040 

1220 1230 1240 1250 1260 
FIRST APLKAALRHATVAPVTVHGTLTE - S SCGFMVDSDVNA VTTVH S S SD PFGKTDERE PL 



******** ■ * **** 



gi 1 163 APLKAALRPSGPSRVTVRGHVAPMARCAAGAASGSPARELAPRLAGGAGRSGPADEGEPL 

1050 1060 1070 1080 1090 1100 

1270 1280 1290 1300 1310 . 1320 

first tnavrsdsaviggviawifiifciigimtrflyqhk-qshrtsqmkekeypenldss.fr 



gi 1 163 VNADRRDSAVIGGVIAWIFILLCITAIAIRIYQQRKLRKENESKVSKKEEC 

1110 1120 1130 1140 1150 

1330 1340 
FIRST_ NE I DLQNTVS EC KRE YF I 



1345 residues in 1 query sequences 
1154 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:29:32 2002 done: Mon Dec 16 15:29:34 2002 
Scan time: 0.050 Display time: 2.167 

Function used was FASTA ■ 
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FASTA searches a protein or DNA sequence data bank 

version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAAV8aisO: 1345 aa 
> F I RST_S EQUENC E 

vs /tmp/fastaDAAW8aisO library 
searchirig /tmp/fastaDAAW8aisO library 

1311 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.066 
The best scores are: °P t 
gi|18496979|ref | NP_207837 . 1 | cell recognition pro (1311) 5338 

»gi | 18496979 | ref | NP_207837 . 1 | cell recognition protein (1311 aa) 

initn: 2330 initl : 2086 opt: 5338 
Smith-Waterman score: 5338; 59.062% identity in 1280 aa overlap (70-1344:33-1310) 

40 50 60 70 80 90 

FIRST KEMDSLPRLTSVLTLLFSGLWHLGLTATNYNCDDPLASLLSPMAFSSSSDLTGTHSP--A 



gi 1 184 LFYLLWLSIDSTKASALTNPNVALFLLADDCDDPLVSALPQASFSSSSELSSSHGPGFA 

10 20 30 40 50 60 

100 110 120 130 140 150 

FIRST. QLNWRVGTGGWSPADSNAQQWLQMDLGNRVEITAVATQGRYGSSDWVTSYSLMFSDTGRN 

.:: : :.::::: :: :::::.: : 

gi 1 184 RLNRRDGAGGWSPLVSNKYQWLQIDLGERMEVTAVATQGGYGSSNWVTSYLLMFSDSGWN 

70 80 90 100 110 120 

160 170 180 190 200 210 

FIRST. WKQYKQEDSIWTFAGNMNADSVWHKLLHSVI^ARFVRFVPLEWNPSGKIGMRVEVYGCSY 



gi 1 184 WKQYRQEDSIWGFSGNANADSVVYYRLQPSIKARFLRFIPLEWNPKGRIGMRIEVFGCAY 

130 140 150 160 170 180 

220 230 240 250 260 270 

FIRST KSDVADFDGRSSLLYRFNQKLMSTLKDVISLKFKSMQGDGVLFHGEGQRGDHITLELQKG 



* • • * * * 



gi 1 184 RSEWDLDGKSSLLYRFDQKSLSPIKDIISLKFKTMQSDGILLHREGPNGDHITLQLRRA 

■ 190 200 210. 220 230 240 

280 290 300 310 320 330 

FIRST RLALHLNLGDSKARLSSSLPSATLGSLLDDQHWH-VLIERVGKQVNFTVDKHTQHFRTKG 



gi 1 184 RLFLLINSGEAKLPSTSTLVNLTLGSLLDDQHWHSVLIQRLGKQVNFTVTDEHRHHFHARG 

250 260 270 280 290 300 

340 350 360 370 380 390 

FIRST ETDALDIDYELSFGGIPVPGKPGTFLKKNFHGCIENLYYNGVNII-LAKRRKHQIYTVGN 



gi 1 184 EFNLMNLDYEISFGGIPAPGKSVSFPHRNFHGCLENLYYNGVDIIDLAKQQKPQIIAMGN 

310 320 330 340 350 360 

400 410 420 430 440 450 

FIRST_ VTFSCSEPQIVPITFNSSGSYLLIiPGTPQIDGLSVSFQFRTWNKDGLLLSTELSEGSGTL 
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gi 1 184 VSFSCSQPQSMPWFLSSRSYLALPDFSGEEEVSATFQFRTWNKAGLLLFSELQLISGGI 

370 380 390 400 410 420 

460 470 480 490 500 510 

FIRST_ LLSLEGGILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAPPAPD 



gi 1 184 LLFLSDGKLKSNLYQPGKLPSDITAGVELNDGQWHSVSLSAKKNHLSVAVDGQMASAAPL 

430 440 . 450 460 * 470 480 

520 530 540 550 560 570 

FIRST_ STWVQIYSGNSYYFGGCPDNLTDSQCLNPIKAFQGCMRLIFIDNQPKDLISVQQGSLGNF 



* * 



gi 1 184 LGPEQIYSGGTYYFGGCPDKSFGSKCKSPLGGFQGCMRLISISGKWDLISVQQGSLGNF 

490 500 510 520 530 540 

580 590 600 610 620 630 

FIRST_ SDLHIDLCSIKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRH 

gi 1 184 SDLQIDSCGISDRCLPNYCEHGGECSQSWSTFHCNCTNTGYRGATCHNSIYEQSCEAYKH 

550 560 570 580 590 600 

640 650 660 670 680 690 

FIRST_ QGNTAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGANPEKPYAMALD 



gi 1 184 RGNTSGFYYIDSDGSGPLEPFLLYCNMTETA-WTIIQHNGSDLTRVRNTNPENPYAGFFE 

610 620 630 640 650 660 

700 710 720 730 740 750 

FIRST_ YGGSMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWGGSPPG 



gi 1 184 YVASMEQLQATINRAEHCEQEFTYYCKKSRLVNKQDGTPLSWWVGRTNETQTYWGGSSPD 

670 680 690 700 710 720 

760 770 780 790 800 810 

FIRST_ VQQCECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPVTQIVITDTDRSNSEAAW 



gi 1 184 LQKCTCGLEGNCIDSQYYCNCDADRNEWTNDTGLLAYKEHLPVTKIVITDTGRLHSEAAY 

730 740 750 760 770 780 

820 830 840 850 860 870 

FIRST RIGPLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFFKTTALSGVFLENLGIKD 



gi 1 184 KLGPLLCRGDRSFWNSASFDTEASYLHFPTFHGELSADVSFFFKTTASSGVFLENLGIAD 

790 800 810 820 830 840 

880 890 900 910 920 930 

FIRST_ FIRLEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWHYVRAERNLKETSLQVDNLPRS 



gi 1 184 F I RI ELRS PTWTF S FDVGNGPFE I SVQS PTHFNDNQWHHVRVERNMKEASLQVDQLTPK 

850 860 870 880 890 900 

940 950 960 970 980 990 

FIRST_ TRETSEEGHFRLQLNSQLFVGGTSSRQKGFLGCIRSLHLNGQKMDLEERAKVTSGVRPGC 



* * 



gi 1 184 TQPAPADGHVLLQLNSQLFVGGTATRQRGFLGCIRSLQLNGMTLDLEERAQVTPEVQPGC 

910 920 930 940 950 960 

1000 1010 1020 1030 1040 1050 

FIRST PGHCSSYGSICHNGGKCVEKHNGYLCDCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPY 
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gi 1 184 RGHCSSYGKLCRNGGKCRERPIGFFCDCTFSAYTGPFCSNEISAYFGSGSSVIYNFQENY 

970 980 990 1000 1010 1020 

1060 1070 1080 1090 1100 1110 

FIRST. PVTKNISLSSSAIYTDSAPSKENIALSFVTTQAPSLLLFINSSSQDFVWLLCKNGSLQV 



gi 1 184 LLSKNSSSHAASFHGDMKLSREMIKFSFRTTRTPSLLLFVSSFYKEYLSVIIAKNGSLQI 

1030 . 1040 1050 .. 1060 1070 . 1080 

1120 1130 1140 1150 1160 1170 

FIRST_ RYHLNK-EETHVFTIDADNFANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVI 



gi 1 184 RYKLNKYQEPDVVNFDFKNMADGQLHHIMINREEGVVFIEIDDNRRRQVHLSSGTEFSAV 

1090 1100 1110 1120 1130 1140 

1180 1190 1200 1210 1220 1230 

FIRST_ RSLTLGKVTENLGLDSEVAKANAMGFAGCMSSVQYl^IAPLKAALRHATVAPVTVHGTLT 



gi 1 184 KSLVLGRILEHSDVDQETALAGAQGFTGCLSAVQLSHVAPLKAALHPSHPDPVTVTGHVT 

1150 1160 1170 1180 1190 1200 

1240 1250 1260 1270 1280 1290 

FIRST_ ESSCGFMVDSDVNAVTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGQVIAVVIFIIFCII 



gi 1 184 ESSCMAQPGTDATSRERTHSFADHSGTIDDREPLANAIKSDSAVIGGLIAWIFILLCIT 

1210 1220 1230 1240 1250 1260 

1300 1310 1320 . 1330 1340 

FIRST_ GIMTRFLYQHKQSHRTSQMKEKEYPENLDSSFRNEIDLQNTVSEGKREYFI 



gi 1 184 AIAVR-IYQQKRLYKRSEAKRSENVDSAEAVLKSELNIQNAVNENQKEYFF 

1270 1280 1290 1300 1310 



1345 residues in 1 query sequences 
1311 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:27:28 2002 done: Mon Dec 16 15:27:30 2002 
Scan time: 0.066 Display time: 2.417 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 



/tmp/fastaGAAf 3aylx: 1345 aa 
> F I R ST_S EQUENC E 

vs /tmp/fastaHAAg3ayIx library 
searching ,/tmp/ fas taHAAg3ayIx library 

1477 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 40, opt: 2S, gap-pen: -12/ -2, width: 16 

Scan time: 0 . 067 
The best scores are: °Pt 
gi|14149613 | ref | NP_004792 . 1 | neurexin 1 isoform a (1477) 459 

»gi | 14149613 | ref |NP_004792 . 1 | neurexin 1 isoform alpha (1477 aa) 

initn: 261 initl: 142 opt: 459 
Smith-Waterman score: 861; 24.780% identity in 1138 aa overlap (221-1269:283-1345) 

200 210 220 230 240 250 

FIRST_ RFWFVPLEWNPSGKIGMRVEWGCSYKSDVADFDGRSSLLYRFNQIG^MSTLKDVISLKF 

... . ■ .».. 

• • • • • • • • ... ■* ..... 

gi | 141 KDCSQEDNNVEGLAHLMMGDQGKSKGKEEYIATFKGSEYFCYDLSQNPIQSSSDEITLSF 

260 270 280 290 300 310 

260 270 280 290 300 \ 310 

FIRST_ KSMQGDGVLFHGEGQRGDHITLELQKGRLALHLNLGDSKARLSSSLPSATLGSLLDDQHW 

. . a . ;. .:...: :..: .. : .::: : : i : • • ■ 

gi 1 141 KTLQRNGLMLH-TGKSADYVNLALKNGAVSLVINLG-SGAFEALVEP VNGKFNDNAW 

320 330 340 350 360 

320 330 340 350 

FIRST_ H-VLIERVGKQ VNFTVDK-HTQHFRTKGETDALDIDYELSFGGIPVPGK-PG 

: : . : . : :...:: : : . . : : . : : : . : : 

gi 1 141 HDVKVTRNLRQHSGIGHAMVTISVT)GILTTTGYTQEDYTMLGSDDFFYVGGSPSTADLPG 

370 380 390 400 410 420 

360 370 380 390 400 , 410 

FIRST_ TFLKKNFHGC I ENLYYNGVNI IL AKRRKHQIYTVGNVTFSCSE-PQIVPITFNSS 

. .. .... •••• 

:.. •* : ::. 

gi 1 141 SPVSNNFMGCLKEWYKNNDVRLELSRLA^ 

430 440 450 460 470 480. 

420 430 440 450 460 

FIRST_ GSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTE LSEGSGTLLLS LE 

: . . : : :.::.:::..::.:-- ....... : - 

gi 1 141 ESFISLPKWNAKKTGSISFDFRTTEPNGLILFSHGKPRHQKDAKHPQMIKVDFFAIEMLD 

490 500 510 520 530 540 

470 480 490 500 510 

FIRST_ GGI-LRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAP-PAPDSTW 



gi 1 141 GHLYLLLDMGSGTIKIKALL--KKVNDGEWYHVDFQRDGRSGTISVNTLRTPYTAPGESE 
550 560 570 580 590 600 

520 530 540 550 560 570 

FIRST. VQIYSGNSYYFGGCPDN LTDSQCLNPI--KAFQGCMRLIFIDNQPKDLISVQQGSL 
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gi 1 141 I-LDLDDELYLGGLPENKAGLVFPTEVWTALLNYGYVGCIRDLFIDGQSKDI--RQMAEV 

610 620 630 640 650 660 

580 590 600 610 620 630 

FIRST. GNFSDLHIDLCSIKDR-- CLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSC 



gi 1 141 QSTAGVKPS-CSKETAKPCLSNPCKNNGMCRDGWNRYVCDCSGTGYLGRSC --EREA 

. . 670 680 690 700 . 710 

640 650 660 670 680 690 

FIRST. EVYRHQGNTAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGANPEKPY 

: . . : . : . : . : : • : ... 

gi 1 141 TVLSYDGSM--FMKIQL PWMHTEAEDVSLRFRSQRAYGILMATTSRDSADTL 

720 730 740 750 760 

700 710 720 730 740 750 

FIRST_ AMALDYGGSMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGRSNERHPYWG 

• • • • • • • • 

: : : . : . . : : .... 

gi 1 141 RLELDAGRV--KLTVNLD C IRINCNSSK GPE-TLFAGYNLNDNEWHTVRV 

770 780 790 800 810 

760 770 780 790 800 

FIRST_ GSPPGVQQCECGLDESCLDIQHFCNCDADKDEWTN-DTGFLSFKDHL PVTQIVITDT 

: . . . : : .:.:.:: : : . : 

gi 1 141 VRRGKSLKLTVD-DQQAMTGQM- -AGDHTRLEFHNIETGI ITERRYLSSVPSNFIGHLQS 

820 830 840 850 860 870 

810 820 830 840 850 

FIRST_ DRSNSEAAW— RIGPL-RCYGDRRF -WNAVSFYTEASYLHFPTFHAEF SAD I SF 



gi 1 141 LTFNGMAYIDLCKNGDIDYCELNARFGFRNIIADPVTFKTKSSYVALATLQAYTSMHLFF 

880 890 900 910 920 930 

860 870 880 890 900 910 

FIRST FFKTTALSGVFLENLGI-KDFIRLEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWHY 



* * m * • * * 



gi 1 141 QFKTTSLDGLILYNSGDGNDFIWELVK-GYTjH^ 

940 950 960 970 980 990 

920 930 940 950 960 
FIRST_ VRAERNLKET-SLQVDNLPRSTRETSEEGHFRLQLNSQLFVGGTSSR QK 



gi 1 141 VMI SRDTSNLHTVKIDT- -KITTQITA-GARNLDLKSDLYIGGVAKETYKSLPKLVHAKE 

1000 1010 1020 1030 1040 

970 980 990 1000 1010 1020 

FIRST. GFLGCIRSLHLNGQKMDLEERAKVTSG-VRPGCPGHCSS-YGSICHNGGKCVEKHNGYLC 



• • * 



* * 



gi 1 141 GFQGCLASVDLNGRLPDLISDALFCNGQIERGCEGPSTTCQEDSCSNQGVCLQQWDGFSC 
1050 1060 1070 1080 1090 1100 

1030 1040 1050 1060 1070 

FIRST DCTNSPYEGPFCKKE-VSAVF-EAGTSVTYMFQEPYPVTKNISLSSSAIYTDSAPSKENI 



gi 1 141 DC SMTS F SG PLCNDPGTT Y I F S KGGGQ I T Y KWP— PNDRPSTRA DRL 

1110 1120 1130 1140 1150 

1080 1090 1100 1110 1120 1130 

FIRST. ALSFVTTQAPSLLLFINSSSQ--DFVWLLCKNGSLQVRYHLNKEETHVFTIDADNFANR 
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gi 1 141 AIGFSTVQKEAVLVRVDSSSGLGDYLELHI-HQGKIGVKFNVGTDDIAIEESNA-IINDG 

1160 ' 1170 1180 1190 1200 1210 

1140 1150 1160 1170 1180 1190 

FIRST_ RMHHLKINREGRELTIQMDQQLRLSYNFSPEVE-FRVIRSLTL--GKVTENLGLDSEVAK 

:.:.:.:. :.:..:.::. - . . : . : • • 

gi 1 141 KYHWRFTRSGGNATLQVDSW ; PVI ERYPAGRQLTIFNSQATI I IG - -GK 

1220 1230 . • 1240 1250 

1200 1210 1220 1230 1240 

FIRST. ANAMGF AGCMS SVQYNH I A PLK - AALRHATVAPV TVHGTLTESSCGFMVDSDV 

• > • • • i. ii i . * i i : • . . : : • •»••• 

gi 1 141 EQGQPFQGQLSGLYYNGLKVLNMAAENDANIAIVGNVRLVGEVPSSMTTESTATAMQSEM 
1260 1270 1280 1290 1300 1310 

1250 1260 1270 1280 1290 1300 

FIRST_ NA VTTVHSSSDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMTRFL 

• * * # • • ■ 

gi 1 141 STSIMETTTTLATSTARRGKPPTKEPISQTTDDILVASAECPSDDEDIDPCEPSSGGLAN 
1320 1330 1340 1350 1360 1370 



1345 residues in 1 query sequences 
1477 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:09:13 2002 done: Mon Dec 16 15:09:14 2002 
Scan time: 0.067 Display time: 2.133 

Function used was FASTA 



http://bioinformatics.lexgen.com/tools/fasta3.php3 



12/16/2002 



Compare Genomic Sequence^^ Page 1 of 3 



J/ 




FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAAPEaigH: 1345 aa 
> F I RS T_S EQUENC E 

vs /.tmp/fastaDAAQEaigH library- 
searching /tmp/fastaDAAQEaigH library . - " 

1642 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.017 
The best scores are: opt 
gi|21166380|ref |NP_620060.1| neurexin 2, isoform (1642) 408 

»gi | 21166380 | ref |NP_620060 . 1 | neurexin 2, isoform alpha (1642 aa) 

initn: 339 initl: 146 opt: 408 
Smith-Waterman score: 816; 24.250% identity in 1134 aa overlap (221-1279:265-1315) 

200 210 220 230 240 250 

FIRST_ RFVRFVPLEWNPSGKIGMRVEVYGCSYKSDVADFDGRSSLLYRFNQKLMSTLKDVISLKF 



* * 



gi | 211 GFGGKFCSEEEHPMEGPAHLTLNSEGKEEFVATFKGNEFFCYDLSHNPIQSSTDEITLAF 

240 250 260 270 280 . 290 

260 270 280 290 300. 310 

FIRST_ KSMQGDGVLFHGEGQRGDHITLELQKGRLALHLNLGDSKARLSSSLPSATLGSLLDDQHW 

.. ■ ... •«•••• • 
■**... .. ••■•>■ .... . . .... . . » . ••• ■>*■ • 

gi | 211 RTLQRNGLMLH-TGKSADYVNLSLKSGAVWLVINLG-SGAFEALVEP VNGKFNDNAW 

300 310 320 330 340 



320 330 340 350 

FIRST_ H-VLIERVGKQ VNFTVDK-HTQHFRTKGETDALDIDYELSFGGIP-VPGKPG 

.. . . . •• .". ... •• 

• • ...... • ... . • 

gi | 211 HDVRVTRNLRQHAGIGHAMVTISVDGILTTTGYTQEDYTMLGSDDFFYIGGSPNTADLPG 
350 360 370 380 390 400 

360 370 380 390 400 410 

FIRST_ TFLKKNFHGC I ENLYYNGVNI IL AKRRKHQIYTVGNVTFSCSE-PQIVPITFNSS 

. • .. ... •»•• 

gi | 211 S PVSNNFMGCLKD WYKNNDFKLEL S RLAKEGDPKMKLQGDLS FRC EDVAALDPVTFES P 
410 420 430 . 440 450 460 

420 430 440 450 460 
FIRST_ GSYLLLPGTPQIDGLSVSFQFRTWNKDGLLLSTELSE GSGT LLLSLEG 

* * *• . . I 

gi | 211 EAFVALPRWSAKRTGSISLDFRTTEPNGLLLFSQGRRAGGGAGSHSSAQRADYFAMELLD 
470 480 490 500 510 520 

470 480 490 500 510 

FIRST_ GILRLVIQKMTERVAEILTGSNLNDGLWHSVSINARRNRITLTLDDEAAP--PAPDSTWV 

• ••• • * 

gi | 211 GHLYLLLDMGSGGIKLRASSRKVNDGEWCHVDFQRDGRKGSISVNSRSTPFLATGDSEIL 
530 540 550 560 570 580 

520 530 540 550 560 570 

FIRST_ QIYSGNSYYFGGCPDNLTDSQCLNP 1 KA- FQGCMRL I F I DNQPKDL - - 1 S VQQG 
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gi | 211 DLES — ELYLGGLPEGGRVDLPLPPEVWTAALRAGYVGCVRDLFIDGRSRDLRGLAEAQG 
590 600 610 620 630 640 

580 590 600 610 620 

FIRST_ SLGNFSDLHIDLCSIKD--RCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQ 

* ■ * * ** * 

« « • *•** * * **•••• 

gi | 211 AVGV APFCSRETLKQCASAPCRNGGVCREGWNRFICDCIGTGFLGRVC ER 

650 660 670 680 690 

630 640 650 660 670 680 

FIRST_ SCEVYRHQGNTAGFFYIDSDGSGPLGPLQWCNITEDKIWTSVQI4NNTELTRVRGANPEK 

. . . > * 

. . . . . . * . •» . . . . . . « . . . 

gi | 211 EATVLS YDGSM- - YMKI MLPNAMHTEAEDVSLRFMSQRAYGLMMATTSRESAD 

700 710 720 730 740 

690 700 7.10 720 730 
FIRST_ P YAMALDYGGSMEQLEAVI DGS E HCEQEVAYHC RRSRLLN-TPDGTPFTWW 

**••* • • • 

******** » * 4 • * * * • 

gi | 211 TLRLELD-GGQMKLTVNLGKGPETLFAGHKLNDl^EWHTVRVVRRGKSLQLSVDN^ 
750 760 770 780 790 800 

740 750 760 770 780 790 

FIRST__ IGRSNERHPYWGGSPPGVQQCECGLDESCLDIQHFCNCDADKDEWTNDTGFLSFKDHLPV 

• ■ * * • • • 

gi | 211 MAGAHMRLEF HNIETGI MTERRFISV-VPSNFIGHLSG-LVFNGQPYM 

810 820 830 840 850 

800 . 810 820 830 840 850 

FIRST_ TQIVITDTDRSNSEAAWRIGPLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADISFFF 

. ■> ■ >••• ■ * • • 

. . . . . * . . . . .«••••••••••••• • . . • 

gi | 211 DQC--KDGDITYCELNARFG-LRAI VADPVTFKSRSSYLALATLQAYASMHLFFQF 

860 870 880 890 900 

860 870 880 890 900 910 

FIRST_ KTTAL S GVFLENLG I - KDF I RLE I S S P S E I TFAI DVGNG PVELWQ S PS LLNDNQWH YVR 

• . • • • • • • • • • • • • • • • • - • •«»••• 

• • • • >■•>• * • ■ • • ••••••••• • • ••••••• • 

gi | 211 KTTAPDGLLLFNSGNGNDFIVIELVK-GYIHWFDLGNGPSLMKGNSDKPVNDNQWHNVV 
910 920 930 940 950 960 

920 930 940 950 960 
FIRST_ AERNLKET-SLQVDNLPRSTRETSEEGHFRLQLNSQLFVGGTS SRQKGF 

■ a ■ •••••> ■•• • •> • ■■*••*■*■■• • > ■ 

gi | 211 VSRDPGNVHTLKIDS--RTVTQHSN-GARNLDLKGELYIGGLSKNMFSNLPKLVASRDGF 

970 980 990 1000 1010 1020. 

970 980 990 1000 1010 1020 

FIRST_ LGCIRSLHLNGQKMDLEERAKVTSG-VRPGCPGH CSSYGSICHNGGKCVEKHNGYLC 

• • •» • * • 

* . ■ ...... • «•••••••••• 

gi | 211 QGCLASVDLNGRLPDLIADALHRIGQVERGCDGPSTTCTEES--CANQGVCLQQWDGFTC 

1030 1040 1050 1060 1070 1080 

1030 1040 1050 1060 1070 

FIRST_ DCTNSPYEGPFCKKEVSAVFEAGTSVTYMFQEPYPVTKNISLSSSAIYTDSAPSK — ENI 

.... ... ..2! .. 

*•■•■•••* •■• •••• • ••••• •••• 

gi | 211 DCTMTSYGGPVCN DPGT — TYIFG KGGALITYTWPPNDRPSTRMDRL 

1090 1100 1110 1120 

1080 1090 1100 1110 1120 1130 

FIRST_ ALSFVTTQAPSLLLFINSSS--QDFVWLLCKNGSLQVRYHLNKEETHVFTIDADN — FA 
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gi|2ll 



AVGFSTHQRSAVLVRVDSASGLGDYLQLHI-DQGTVGVIFWGTDD 
1130 1140 1150 1160 1170 



ITIDEPNAIVS 
1180 



FIRST 



1140 1150 1160 1170 1180 1190 

NRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLTLGKVTENLGLDSEVAKA 



gi|211 



DGKYHWRFTRSGGNATLQVDSW-PVNERYPAGRQLTIFNSQAAIKIG GRDQ 

1190 1200 1210 1220. . 1230 



FIRST 



1200 1210 1220 1230 1240 1250 

NAMGFAGCMSSVQYNHIAPLKAALRHATOAPVTVHGTLTE 



gi|211 



-GRPFQGQVSGLYYNGLKVLALAAESDPNWTEGHLRLVGEGPSVLLSAETTATTLLADM 
1240 1250 1260 1270 1280 1290 



FIRST 



1260 1270 1280 1290 1300 1310 

SDPFGKTDEREPLTNAVRSDSAVIGGVIAWIFIIFCIIGIMTRFLYQHKQSHRTSQMKE 



gi|211 



ATTIMETTTTMATTTTRRGRSPTLRDSTTQNTDDLLVASAECPSDDEDLEECEPSTGGEL 
1300 1310 1320 1330 1340 1350 



1345 residues in 1 query sequences 
1642 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:05:13 2002 done: Mon Dec 16 15:05:15 2002 
Scan time: 0.017 Display . time : 2.166 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAA4caW3G: 1345 aa 
> F I RST_SEQUENC E 

vs /tmp/fastaDAA5caW3G library 
searching /tmp/ fas taDAA5caW3G library . 



1392 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 

join: 40, opt: 28, gap-pen: -12/ -2, width: 16 

Scan time: 0.033 
The best scores are: opt 
gi|23498650|emb|CAC87720.2| neurexin 3-alpha [Horn (1392) 369 

»gi|23498650|emb|CAC87720.2| neurexin 3-alpha [Homo sap (1392 aa) 

initn: 324 initl: 139 opt: 369 
Smith-Waterman score: 774; 25.791% identity in 1012 aa overlap (220-1156:255-1184) 

190 200 210 220 230 240 

FIRST_ ARFVRFVPLEWNPSGKIGMRVEVYGCSYKSDVADFDGRSSLLYRFNQKLMSTLKDVISLK 

• « ... 
• • . • • • ..:» ... •» .... 
... . ■ • • .... 

gi I 234 STTGYGGKLCSEGLSHLMMSEQGRSKAREENVATFRGSEYLCYDLSQNPIQSSSDEITLS 

230 240 250 260 270 280 

250 260 270 280 290 300 

FIRST_ FKSMQGDGVLFHGEGQRGDHITLELQKGRLALHLNLGDSKARLSSSLPSATLGSLLDDQH 



gi I 234 FKTWQRNGLILH-TGKSADYVNLALKDGAVSLVINLG-SGAFEAIVEP-— VNGKFNDNA 

290 300 310 320 330 

310 320 330 340 350 360 

FIRST_ WH-VLIERVGKQVNFTVDK-HTQHFRTKGETDALDIDYELSFGGIPVPGK-PGTFLKKNF 



gi I 234 WHDVKVTRNLRQVTI SVDGILTTTGYTQEDYTMLGSDDFF YVGGS PSTADLPGS PVSNNF 
340 350 360 370 380 390 

370 380 390 400 410 

FIRST. HGC I ENL YYNGVNI I LAKRR KHQIYTVGNVTFSCSE-PQIVPITFNSSGSYLL 

:...:: : : . : : :.:.:.:. .::.:.. - : - 

gi I 234 MGCLKE\A^K]^iRLELSRI^IADTKMKIY--GF^FKCENVATLDPINFETPEAYIS 
400 410 420 430 . 440 450 

420 430 440 450 460 

FIRST_ LPGTPQIDGLSVS FQFRTWNKDGLLLSTE LSEGSGTLLLSLE- -GGILRL 



... ..«• ■■..'*....• • * 



gi I 234 LPKWNTKRMGSISFDFRTTEPNGLILFTHGKPQERKDARSQKNTKVDFFAVELLDGNLYL 
460 470 480 490 500 510 

470 480 490 500 510 520 

FIRST_ VI QKMTERVAE I LTGSNLNDGLWHSVS INARRNRITLTLDDEAAP- PAPDSTWVQI YSGN 



gi I 234 LLDMGSGTIKVKATQKKANDGEWYHVDIQRDGRSGTISVNSRRTPFTASGESEILDLEGD 
520 530 540 550 560 570 

530 540 550 560 570 

FIRST- SYYFGGCPDN LTDSQCLNPI — KAFQGCMRLIFIDNQPKDL I SVQQGSLGNFSD 
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ai I 2 3 4 -M^GLPENRAGLILPTELWTAMLNYGYVGCIRDLFIDGRSKNIRQLAEMQNAAGVKSS 
580 590 600 610 620 630 

580 590 600 610 620 630 

FIRST LHIDLCS— IKDRCLPNYCEHGGSCSQSWTTFYCNCSDTSYTGATCHNSIYEQSCEVYRH 

: : : . : : : 

• * * *■*■* • •••• 

oi I 234 -CSRMSAKQCDSYPCKNNAVCKDGWNRFICDCTGTGYWGRTCER EASILSY 

1 -•• 6 40 650 660 670 680 

640 650 660 670 680 690 

FIRST_ QGNTAGFFYIDSDGSGPLGPLQVYCNITEDKIWTSVQHNNTELTRVRGANPEKPYAM- -A 

... . : . : : :.:...:. : : 

• |„, DCS MYMKI IMPMVMHTEAEDVSFRFMS-QRAYGLLVA 

91,234 690 700 710 720 

700 710 720 730 740 

FIRST_ LDYGGSMEQLEAVIDGSEHCEQEVAYHCRRSRLLNTPDGTPFTWWIGR— SNERHPY— 

ai I 234 TTSRDSADTLRLELDGG RVPXMVNLGKG- PETLYAGQKLNDNEWHTVRV 

1 730 740 750 760 770 

750 760 770 780 790 

FIRST. — WGGS PPGVQQCECGLDESCLDIQHFCNCDADKDEWTN- -DTGFLSFKDHLPVT 

; . . ; .... 

ai I 234 VRRGKSLKLTVDDDVAEGTMVGDHTRLEFHNI ETGIMTEKRYI S WPS SF I GHLQSLMFN 
1 78O 790 800 810 820 830 

800 810 , 820 830 840 850 

FIRST_ QIVITDT— -DRSNSEAAWRIGPLRCYGDRRFWNAVSFYTEASYLHFPTFHAEFSADIS 

oi I 234 GLLYIDLCKNGDIDYCEIiKARFG-LiRNI IADPVTFKTKSSYLSLATLQAYTSMHLF 
y 1 8 40 850 860 870 880 

860 870 880 890 900 910 

FIRST_ FFFKTTALSGVFLENLGI-KDFIRLEISSPSEITFAIDVGNGPVELWQSPSLLNDNQWH 



• • • • 



ai I 234 FOFKTTSPDGFILFNSGDGNDFIAVELVK-GYIHYVFDLGNGPNVIKGNSDRPLNDNQWH 
' 890 900 910 920 930 940 

920 930 940 950 960 

FIRST. YVRAERNLKET- SLQVDNLPRSTRETS E - - EGHFRLQLNSQLFVGGTS S 

• * 
ai I 234 NWITRDNSNTHSLKVD TKWTQVINGAKNLDLKGDLYMAGLAQGMYSNLPKLVA 

950 960 970 980 990 

970 980 990 1000 1010 

FIRST. RQKGFLGC IRSLHLNGQKMDLEERAKVTSG - VRPGC PGHC S S - YGS ICHNGGKCVEKHNG 

ai I 234 SRDGFQGC^SW^GRLPDLI^ALHRSGQIERGCEGPSTTCQEDSCANQGVCMQQWEG 
1000 1010 1020 1030 1040 1050 

1020 1030 1040 1050 1060 1070 

FIRST. YLCDCTNS PYEGPFCKKE - VS AVF - EAGTSVTYMFQEPYPVTKNI SLS S S AI YTDS APSK 



:. 

ai I 234 FTCDCSMTSYSGNQCNDPGATYIFGKSGGLILYT WPANDRPSTRS 

1060 1070 1080 1090 1100 

1080 1090 1100 UIO 1120 1130 

FIRST. ENIALSFVTTQAPSLLLFINSSSQ- - DFVWLLC KNG S LQVRYHLNKE ETHVFT I DADNF 



http :/A)ioinformatics .lexgen .com/tools/f asta3 .php3 
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oi I 234 DRIAVGFSTTVKDGILWIDSAPGI^DFLQLHI-EQGKIGWFNIGTVDISIKE-ERTPV 
' mo 1120 1130 1140 1150 1160 

1140 1150 1160 1170 1180 1190 

FIRST. ANRRMHHLKINREGRELTIQMDQQLRLSYNFSPEVEFRVIRSLTLGKVTENLGLDSEVAK 



ai I 234 NDGKYlWvRFTRNGGNATLQVDNWPVNEHY^ 

1 .. 1170 1180 1190. 1200 1210 1220 



1345 residues in 1 query sequences 
1392 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Dec 16 15:07:25 2002 done: Mon Dec 16 15:07:26 2002 
Scan time: 0.033 Display time: 1.817 



Function used was FASTA 
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characterize the protein. A starting material that can only be used to produce 
a final product does not have a substantial asserted utility in those instances 
where the final product is not supported by a specific and substantial utility. 
In this case none of the proteins that are to be produced as final products 
resulting from processes involving the claimed cDNA have asserted or 
identified specific and substantial utilities. The research contemplated by 
Applicants to characterize potential protein products, especially their 
biological activities, does not constitute a specific and substantial utility. 
Identifying and studying the properties of the protein itself or the 
mechanisms in which the protein is involved does not define a "real world" 
context of use. Note, because the claimed invention is not supported by a 
specific and substantial asserted utility for the reasons set forth above, 
credibility has not been assessed. Neither the specification as filed nor any 
art of record discloses or suggests any property or activity for the cDNA 
compounds such that another non-asserted utility would be well established 
for the compounds. 

Claim 1 is also rejected under 35 U.S.C. § 112, first paragraph. 
Specifically, since the claimed invention is not supported by either a specific 
and substantial asserted utility or a well established utility for the reasons set 
forth above, one skilled in the art would not know how to use the claimed 
invention. 

Example 10: DNA Fragment encoding a Full Open Reading Frame 
(ORF) 

Specification: The specification discloses that a cDNA library was prepared 
from human kidney epithelial cells and 5000 members of this library were 
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sequenced and open reading frames were identified. The specification 
discloses a Table that indicates that one member of the library having SEQ 
ID NO: 2 has a high level of homology to a DNA ligase. The specification 
teaches that this complete ORF (SEQ ID NO: 2) encodes SEQ ID NO: 3. 
An alignment of SEQ ID NO: 3 with known amino acid sequences of DNA 
ligases indicates that there is a high level of sequence conservation between 
the various known ligases. The overall level of sequence similarity between 
SEQ ID NO: 3 and the consensus sequence of the known DNA ligases that 
are presented in the specification reveals a similarity score of 95%. A search 
of the prior art confirms that SEQ ID NO: 2 has high homology to DNA 
Ligase encoding nucleic acids and that the next highest level of homology is 
to alpha-actin. However, the latter homology is only 50%. Based on the 
sequence homologies, the specification asserts that SEQ ID NO: 2 encodes a 
DNA ligase. 

Claim 1: An isolated and purified nucleic acid comprising SEQ ID NO: 2. 

Analysis: The following analysis includes the questions that need to be 
asked according to the guidelines and the answers to those questions based 
on the above facts: 

1 ) Based on the record, is there a "well established utility" for the 
claimed invention? Based upon applicant's disclosure and the results of the 
PTO search, there is no reason to doubt the assertion that SEQ ID NO: 2 
encodes a DNA ligase. Further, DNA ligases have a well-established use in 
the molecular biology art based on this class of protein's ability to ligate 
DNA. Consequently the answer to the question is yes. 
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Note that if there is a well-established utility already associated with the 
claimed invention, the utility need not be asserted in the specification as 
filed. In order to determine whether the claimed invention has a well- 
established utility the examiner must determine that the invention has a 
specific, substantial and credible utility that would have been readily 
apparent to one of skill in the art. In this case SEQ ID NO: 2 was shown to 
encode a DNA ligase that the artisan would have recognized as having a 
specific, substantial and credible utility based on its enzymatic activity. 

Thus, the conclusion reached from this analysis is that a 35 U.S.C. § 
101 rejection and a 35 U.S.C. § 1 12, first paragraph, utility rejection should 
not be made. 

Example 11: Animals with Uncharactefized Human Genes 

Specification: Kidney cells from a patient with Polycystic Kidney (PCK) 
Disease have been used to make a cDNA library. From this library 8000 
nucleotide "fragments" have been sequenced but not yet used to express 
proteins in a transformed host cell nor have they been characterized in any 
other way. The 50 longest fragments, SEQ ID NO: 1-50, respectively, have 
been used to make transgenic mice. None of the 50 lines of mice have 
developed Polycystic Kidney Disease to date. The asserted utility is the use 
of the mice to research human genes from diseased human kidneys. The 
disease is inheritable, but chromosomal loci have not yet been identified. 
Neither the absence or presence of a specific protein has been identified with 
the disease condition. 
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ABSTRACT 



A method and apparatus for preparation of a substrate 
containing a plurality of sequences. Photoremovable 
groups are attached to a surface of a substrate. Selected 
regions of the substrate are exposed to light so as to 
activate the selected areas. A monomer, also containing 
a photoremovable group, is provided to the substrate to 
bind at the selected areas. The process is repeated using 
a variety of monomers such as amino acids until sequen- 
ces of a desired length are obtained. Detection methods 
and apparatus are also disclosed. 
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ogy. A peptide is a sequence of amino acids. When the 
ARRAY OF OLIGONUCLEOTIDES ON A SOLID twenty naturally occurring amino acids are condensed 
1 SUBSTRATE into polymeric molecules they form a wide variety of 

three-dimensional configurations, each resulting from a 
CROSS-REFERENCE TO RELATED 5 particular amino acid sequence and solvent condition. 

APPLICATIONS ' . . The number of possible pentapeptides of the 20 natu- 

ThisappHcation is a Rule 60 Division of U.S.applica- occurring arr^o acid* for e^ple is 2(P or 3.2 

tion Ser No. 850,356, ffled Mar. 12, 1992, which is a million different peptides The likelihood that molecules 
Rule 60 Division of U.S. application Ser. No. 492,462, of this size might be useful in receptor-bmding studies is 
filed Mar 7 1990, now U.S. Pat No. 5,143,854, which 10 supported by epitope analysis studies showing that 
is a Continuation-in-Part of U.S. application Ser. No. some antibodies recognize sequences as short as a few 
362,901, filed Jun. 7, 1989, now abandoned, all assigned amino acids with high specificity. Furthermore, the 
to the assignee of the present invention. * average molecular weight of amino acids puts small 

The file of this patent contains drawings executed in peptides in the size range of many currently useful phar- 
color. Copies of this patent with color drawings will be 15 maceutical products. 

provided by the Patent and Trademark Office upon Pharmaceutical drug discovery is one type of re- 
request and payment of the necessary fee. search which relies on such a study of structure-activity 

relationships. In most cases, contemporary pharmaceu- 
COPYRIGHT NOTICE ticai research can be described as the process of discov- 

A portion of the disclosure of this patent document 20 ering novel ligands with desirable patterns of specificity 
contains material which is subject to copyright protec- f or biologically important receptors. Another example 
tion. The copyright owner has no objection to the fac- ^ research to discover new compounds for use in agri- 
simile reproduction by anyone of the patent document culture, such as pesticides and herbicides, 
or the patent disclosure as it appears in the Patent and Sometimes, the solution to a rational process of de- 
Trademark Office patent file or records, but otherwise 25 ^g^g ligands is difficult or unyielding. Prior methods 
reserves all copyright rights whatsoever. 0 f preparing large numbers of different polymers have 

BACKGROUND OF THE INVENTION ^en painstakingly slow when used at a scale sufficient 

to permit effective rational or random screening. For 

The present inventions relate to the synthesis and example, the "Merrifield" method (/. Am. Chem Soc. 
placement of materials at known locations. In particu- 30 85:2149-2154, which is incorporated herein by 

lar, one embodiment of the inventions provides a reference 'for all purposes) has been used to synthesize 
method and associated apparatus for preparing diverse tides on a ^ snpp0lL In the Merrifield method, 

chemical sequences at known locations on a single sub- m aniinn add b ^a^tly bonded to a support made of 
strate surface. The inventions may be applied, for exam- ^ ins6WWe poIymer . Another amino acid with an alpha 
pie, in the field of preparation of oligomer, peptide, 35 ^ fa reacted ^ ^ covalently bonded 
nucleic acid, ohgosacchande, phosphohmd polyrner ^ form a dipeptide . After washing, the 

or drug congener preparation, especially to create tiye fe removed ^ a third acid 

source, of chemical diversity for use in screening for P.^ ^ ^ ^ fc ^ tQ ^ dipep _ 

biological activity. • tid ^ s ^ con tinued until a peptide of a de- 

The relationship between structure and activity of 40 l i uc * Am5> 10 vv . . T V.;1„ w ■ 

molecules is a fundamental issue in the study of biologi- «?* leng^and sequence is 

crfmtems. Structure-activity relationships are impor- SOd method, it is not economically practical to synthe- 

faction ofen- size more than a handful of peptide sequences in a day^ 
zym£ the ways in which cells communicate with each To synthesize larger numbers of polymer sequences, 
Subcellular control and feedback systems. 45 it has abo been proposed to use a senes of reacdon 
Certain macromolecules are known to interact and vessels for polymer syntheas. For example, a tubuhr 
bind to other molecules having a very specific three-di- reactor system may be used to synthesize a linear poly- 
SSatial and electronic dist^ution. Any large mer on a solid phase support bjr automated sequenual 
molecule havmg such specificity can be considered a addition of reagents. This metiiod still does not enable 
receptor, whether it is anenzyme catalyzing hydrolysis 50 the synthesis of a sufficiently large number of polymer 
of a metabolic intermediate, a cell-surface protein medi- sequences for effective economical screening, 
ating membrane transport of ions, a glycoprotein serv- Methods of preparing a plurality of polymer sequen- 
ing to identify a particular cell to its neighbors, an IgG- ces are also known m which a porous container encloses 
class antibody circulating in the plasma, an oligonucleo- a known quantity of reactive particles, the particles 
tide sequence of DNA in the nucleus, or the like. The 55 being larger in size than pores of the container The 
various molecules which receptors selectively bind are containers may be selectively reacted with desired ma- 
known as ligands. terials to synthesize desired sequences of product mole- 
Many assays are available for measuring the binding cules. As with other methods known in the art, this 
affinity of known receptors and ligands. but the infor- method cannot practically be used to synthesize a sum- 
mation which can be gained from such experiments is 60 cient variety of polypeptides for effective screening, 
often limited by the number and type of ligands which Other techniques have also been described. These 
are available. Novel ligands are sometimes discovered methods include the synthesis of peptides on 96 plastic 
by chance or by application of new techniques for the pins which fit the format of standard microtiter plates, 
elucidation of molecular structure, including x-ray crys- Unfortunately, while these techniques have been some- 
tallographic analysis and recombinant genetic tech- 65 what useful, substantial problems remain. For example, 
niques for proteins. these methods continue to be limited in the diversity of 
Small peptides are an exemplary system for exploring sequences which can be economically synthesized and 
the relationship between structure and function in biol- screened. 
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From the above, it is seen that an improved method ma.sk is placed on or focused on the substrate and illumi- 
and apparatus for synthesizing a variety of chemical nated so as to deprotect selected regions of the substrate 
sequences at known locations is desired. in the reactor space. A monomer is pumped through the 

reactor space or otherwise contacted with the substrate 
SUMMARY OF THE INVENTION 5 ^ reacts ^ ^ dep rotected regions. By selectively 

An improved method and apparatus for the prepara- deprotecting regions on the substrate and flowing pre- 
tion of a variety of polymers is disclosed. determined monomers through the reactor space, de- 

In one preferred embodiment, linker molecules are sired polymers at known locations may be synthesized, 
provided on a substrate. A terminal end of the linker Improved detection apparatus and methods are also 
molecules is provided with a reactive functional group 10 disclosed The detection method and apparatus utilize a 
protected with a photoremovable protective group. substrate having a large variety of polymer sequences at 
Using lithographic methods, the photoremovable pro- known locations on a surface thereof. The substrate is 
tective group is exposed to light and removed from the exposed to a fluorescently labeled receptor which binds 
linker molecules in first selected regions. The substrate to one or m0 re of the polymer sequences. The substrate 
is then washed or otherwise contacted with a first mon- 15 ^ placed m a microscope detection apparatus for identi- 
omer that reacts with exposed functional groups on the fication of locations where binding takes place. The 
linker molecules. In a preferred embodiment, the mono- microscope detection apparatus includes a monochro- 
mer is an amino acid containing a photoremovable pro- mat i c or polychromatic light source for directing light 
tective group at its amino or carboxy terminus and the at ^ SUDS trate, means for detecting fluoresced light 
linker molecule terminates in an amino or carboxy acid 20 from ^ ^strate, and means for determining a loca- 
group bearing a photoremovable protective group. tion of ^ fl u0r esced light The means for detecting 

A second set of selected regions is, thereafter, ex- | ight fluoresced on the substrate may in some embodi- 
posed to light and the photoremovable protective group ments j^mie a photon counter. The means for deter- 
on the linker molecule/protected amino acid is re- mining a location of the fluoresced light may include an 
moved at the second set of regions. The substrate is then 25 ^ tta ^ ldat j 0|l ^le for the substrate. Translation of the 
contacted with a second monomer containing a ^ ^ ^ ^^on m recorded and managed by 
photoremovable protective group for reaction with m tvptaodtidy programmed digital computer, 
exposed functional groups. This process is repeated to A further understanding of the nature and advantages 
selectively apply monomers until polymers of * desired Qf ^ herein ^ be realized by reference to 

length and desired chemical sequence are obtamed. 30 remamin po^ons of ^ specification and the at- 
Photolabile groups are then optionally removed and. the £ 
sequence is, thereafter, optionally capped. Side chain »~ 

protective groups, if present, are also removed. BRIEF DESCRIPTION OF THE FIGURES 

By using the litho^aphic techniques disclosed illustrates masking and irradiation of a sub- 

herein, h is possible to direct light tc > r^trvely small 35 J™ J Son. THeibstrate is shown in cross- 
and precisely known locations on the substrate. It is, 

therefore, possible to synthesize polymers of a known °£ m ^ ^ application 0 f a 

chemical sequence at known locations on the substrate. rlvJ - * „ rf, 

The resulting substrate will have a variety of uses £ ; ^ lmaMm of ±e substrate at a 

including, for example, screening large numbers of pol- 40 ' ~. 

ymers for biological activity. To screen for biologic^ «£g ^^rates the substrate after application of 

activity, the substrate is exposed to one or more recep- rilJ - * "i„ " ua mc auuau v * 

tors such as antibodies whole cells, receptors otr vesi- »onomer i^t™ 0 f the "A" monomer, 

^ ^ « »y «^ rf • ^^^SfL^S*- 4S FIG JmSSthesubstrateafterasecondappUca- 
The receptors are preferably labeled with, for example, 45 % f «m» 

a fluorescent marker, radioactive marker, or a labeled to 5J°: J*.,; , . ^ , . - e „ K e+«+*.. 

Through knowledge of the sequence of the material at 50 on a substrate; . ^ 

Sbinding is detected, it is possible to FIG. 9 illustrates a detection apparatus for locatmg 
quickly determine which sequence binds with the re- fluorescent markers on the substrate; , 
Sr and, therefore, the^chnique can be used to FIGS. 10A-10M fllustrate the method as £ , apphed 
screen lar g ; numbers of peptides. Other possible appli- to die production of the tamers of monomers A and 

cations of the inventions herein include diagnostics in 55 **B M ; 

which various antibodies for particular receptors would FIGS. 11A and 11B are fluorescence traces for Stan- 
be placed on a substrate and, for example, blood sera dard fluorescent beads; 

would be screened for immune deficiencies. Still further FIGS. 12A and 12B are fluorescence curves for 
applications include, for example, selective "doping" of NVOC (6-nitroveratryloxycarbonyl) slides not exposed 
organic materials in semiconductor devices, and the 60 and exposed to light respectively; 
jj£ FIGS. 13A to 13D are fluorescence plots of slides 

In connection with one aspect of the invention an exposed through 100 jim, 50 jun, 20 urn and 10 urn 
improved reactor system for synthesizing polymers is masks; 14A and 14B illustrate formation of YGGFL (a 
also disclosed. The reactor system includes a substrate peptide of sequence H2N-tyrosme-^ycme-glycme- 
mount which engages a substrate around a periphery 65 phenylalaBme-leucine-CChH) and GGFL (a peptide ot 
thereof. The substrate mount provides for a reactor sequence H 2 N-glycme-glycme-phenylal^e.leucine- 
space between the substrate and the mount through or CO2H), followed by exposure to labeled Here antibody 
into which reaction fluids are pumped or flowed. A (an antibody that recognizes YGGFL but not GGFL); 
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FIGS. ISA and 15B fluorescence plots of a slide with 
a checkerboard pattern of YGGFL and GGFL exposed 
to labeled Herz antibody, FIG. ISA illustrates a 
500 X 500 /im mask which has been focused on the sub- 
strate according to FIG. 8A while FIG. 15B illustrates 
a 50 x 50 jim mask placed in direct contact with the 
substrate in accord with FIG. 8B; 

FIG. 16 is a fluorescence plot of YGGFL and 
PGGFL synthesized in a 50 ftm checkerboard pattern; 

FIG. 17 is a fluorescence plot of YPGGFL and 
YGGFL synthesized in a 50 urn checkerboard pattern; 

PIGS. 18A and 18B illustrate the mapping of sixteen 
sequences synthesized on two different glass slides; 

FIG. 19 is a fluorescence plot of the slide illustrated 
in FIG. 18A; and 

FIG. 20 is a fluorescence plot of the slide illustrated 
in FIG. 10B. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 
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L Glossary 

The following terms are intended to have the follow- 55 
ing general meanings as they are used herein: 

1. Complementary: Refers to the topological compati- 
bility or matching together of interacting surfaces of 
a ligand molecule and its receptor. Thus, the receptor 
and its ligand can be described as complementary, 
and furthermore, the contact surface characteristics 
are complementary to each other. 

2. Epitope: The portion of an antigen molecule which is 
delineated by the area of interaction with the subclass 
of receptors known as antibodies. 

3. Ligand: A ligand is a molecule that is recognized by 
a particular receptor. Examples of ligands that can be 
investigated by this invention include, but are not 
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restricted to, agonists and antagonists for cell mem- 
brane receptors, toxins and venoms, viral epitopes, 
hormones (e.g., steroids, etc), hormone receptors, 
peptides, enzymes, enzyme substrates, cofactors, 
drugs (e.g., opiates, etc), lectins, sugars, oligonucleo- 
tides, nucleic acids, oligosaccharides, proteins, and 
monoclonal antibodies. 

4. Monomer: A member of the set of small molecules 
which can be joined together to form a polymer. The 
set of monomers includes but is not restricted to, for 
example, the set of common L-amino acids, the set of 
D-amino acids, the set of synthetic amino acids, the 
set of nucleotides and the set of pentoses and hexoses. 
As used herein, monomers refers to any member of a 
basis set for synthesis of a polymer. For example, 
dimers of L-amino acids form a basis set of 400 mono- 
mers for synthesis of polypeptides. Different basis 
sets of monomers may be used at successive steps in 
the synthesis of a polymer. 

5. Peptide: A polymer in which the monomers are alpha 
amino acids and which are joined together through 
amide bonds and alternatively referred to as a poly- 
peptide. In the context of this specification it should 
be appreciated that the amino acids may be the L- 
optical isomer or the D -optical isomer. Peptides are 
more than two amino acid monomers long, and often 
more than 20 amino acid monomers long. Standard 
abbreviations for amino acids are used (e.g., P for 
proline). These abbreviations are included in Stryer, 
Biachemstry, Third Ed., 1988, which is incorporated 
herein by reference for all purposes, . 

6. Radiation: Energy which may be selectively applied 
including energy having a wavelength of between 
10- 14 and 10* meters including, for example, electron 
beam radiation, gamma radiation, x-ray radiation, 
ultraviolet radiation, visible light, infrared radiation, 
microwave radiation, and radio waves. "Irradiation" 
refers to the application of radiation to a surface. 

7. Receptor. A molecule that has an affinity for a given 
ligand. Receptors may be narurally-occuring or man- 
made molecules. Also, they can be employed in their 
unaltered state or as aggregates with other species. 
Receptors may be attached, covalently or noncova- 
lendy, to a binding member, either directly or via a 
specific binding substance. Examples of receptors 
which can be employed by this invention include, but 
are not restricted to, antibodies, cell membrane recep- 
tors, monoclonal antibodies and antisera reactive 
with specific antigenic determinants (such as on vi- 
ruses, cells or other materials), drugs, polynucleo- 
tides, nucleic acids, peptides, cofactors, lectins, sug- 
ars, polysaccharides, cells, cellular membranes, and 
organelles. Receptors are sometimes referred to in the 
art as anti-ligands. As the term receptors is used 
herein, no difference in meaning is intended. A "Li- 
gand Receptor Pair" is formed when two macromol- 
ecules have combined through molecular recognition 
to form a complex. 

Other examples of receptors which can be investi- 
gated by this invention include but are not restricted to: 

a) Microorganism receptors: Determination of li- 
gands which bind to receptors, such as specific 
transport proteins or enzymes essential to survival 
of microorganisms, is useful in a new class of antibi- 
otics. Of particular value would be antibiotics 
against opportunistic fungi, protozoa, and those 
bacteria resistant to the antibiotics in current use. 



5,445,934 

b) Enzymes: For instance, the binding site of enzymes o-Hydroxy-a-methyl rinnamoyl, and 2-Oxymethy- 
such as the enzymes responsible for cleaving neu- lene anthraquinone. Other examples of activators 
retransmitters; determination of ligands which bind include ion beams, electric fields, magnetic fields, 
to certain receptors to modulate the action of the electron beams, x-ray, and the like. 

enzymes which cleave the different neurotransmit- 5 10. Predefined Region: A predefined region is a local- 

ters is useful in the development of drugs which ized area on a surface which is, was, or is intended to 

can be used in the treatment of disorders of neurb- be activated for formation of a polymer. The prede- 

transmission. fined region may have any convenient shape, e.g., 

c) Antibodies: For instance, the invention may be circular, rectangular, elliptical, wedge-shaped, etc. 
useful in investigating the ligand-binding site on the 10 For the sake of brevity herein, "predefined regions" 
antibody molecule which combines with the epi- are sometimes referred to simply as "regions." 
tope of an antigen of interest; deter minin g a se- n. Substantially Pure: A polymer is considered to be 
quence that mimics an antigenic epitope may lead "substantially pure" within a predefined region of a 
to the development of vaccines of which the immu- substrate when it exhibits characteristics that distin- 
nogen is based on one or more of such sequences or 15 g^jj j t f rom 0 ther predefined regions. Typically, 
lead to the development of related diagnostic purity will be measured in terms of biological activity 
agents or compounds useful in therapeutic treat- or function ^ a result of uniform sequence. Such 
ments such as for auto immune diseases (e.g., by characteristics will typically be measured by way of 
blocking the binding of the "self' antibodies). binding with a selected ligand or receptor. 

d) Nucleic Acids: Sequences of nucleic acids may be 20 H 

synthesized to establish DNA or RNA binding ^ present invention provides methods and appara- 

sequences. tus f or th e preparation and use of a substrate having a 

e) Catalytic Polypeptides: Polymers, preferably poly- plu raiity of polymer sequences in predefined regions, 
peptides, which are capable of promoting a chemi- The fc^^ ^ described herein primarily with regard 
cal reaction involving the conversion of one or 25 tQ ^ ation of molecules containing sequences of 
more reactants to one or more products Such Wo aciaX but ^uld readily be appUed in the prepara- 
polypeptides generally include a binding site spe- tion f other j ^ Such powers include, for 
cific for at least one ^reactant or reaction mtermedi- both linear and cyclic polymers of nucleic 
ate and an active m acids, polysaccharides, phospholipids, and peptides 
binding site, which ^ c ^^^ r ^ ab ^ t f 30 having either a-, /*-, or c* amino acids, heteropolymers 
chemicaDy modifying die bound reactant Cato- * known drug is covalently bound to any of 
lytic polypeptides are described m, for example, *" . . B . - ' _ ,, r ^ , 

US. aVp^ation Ser. No. 404,920, which is incor- *™> PO ^eAanes, P 01 ^ ^^ 1 ^.^ 

ported herein by reference for all purposes. f 01 ^ polyamides, 

f) Hormone receptors: For instance, the receptors for 35 sulfi , des ' P°lysiloxane£ polynmd«, Prostate* or 
insulin and gro wth hormone. Determination of the other polymers which will be apparent uj^n reraw of 
* " , *l , . , ... , . , ^ „ ^-^t^- this disclosure. In a preferred embodiment, the urven- 
hgands which bind with high affinity to a receptor ^. 7" . . _ . *\, . - ^^a^ 

is useful in the development of, for example, an oral tion herein is used m the synthesis of peptide^ 

replacement of the daily injections which diabetics The prepared substrate may J «^ 

must take to relieve the symptoms of diabetes, and 40 screemng a variety of polymers '"J^tetodw 

in the other case, a replacement for the scarce ™* » receptor, although it wul be apparent that tte 

human growthh^rmone which can only be ob- fvenuon could be used for the £^<f 

tained from cadavers or by recombinant DNA for bmdmg with a hgand. The substrate^losed herein 

technology. Other examples are the vasoconstric- will have ■ .wide vanety of other us^ Merely by way of 

tive hormone receptors; determination of those 45 example^ ^e invention hereui can be used in determm- 

ligands which bind to a receptor may lead to the *g peptide and nucleic acid sequences which bind to 

development of drugs to control blood pressure. proteins, finding sequence-specific bmdmg drugs, iden- 

g) Opiate receptors: Detennination of ligands which tifying epitopes recognized by antibodies, and evalua- 
bind to the opiate receptors in the brain is useful in tion of a vanety of drugs for clmical and diagnostic 
the development of less-addictive replacements for 50 applications, as well as combinations of the above, 
morphine and related drugs. The invention preferably provides for the use of a 

8. Substrate: A material having a rigid or semi-rigid substrate "S" with a surface. Linker molecules *L are 
surface. In many embodiments, at least one surface of optionally provided on a surface of the substrate. The 
the substrate will be substantially flat, although in purpose of the linker molecules, m some embodiments, 
some embodiments it may be desirable to physically 55 is to facilitate receptor recognition of the synthesized 
separate synthesis regions for different polymers polymers. 

with, for example, wells, raised regions, etched Optionally, the linker molecules may be chemically 
trenches, or the like. According to other embodi- protected for storage purposes. A chemical storage 
ments, beads may be provided on the surface protective group such as t-BOC (t-butoxycarbonyl) 
which may be released upon completion of the syn- 60 may be used in some embodiments. Such chemical pro- 
thesis, tective groups would be chemically removed upon 

9. Protective Group: A material which is bound to a exposure to, for example, acidic solution and would 
monomer unit and which may be spatially removed serve to protect the surface during storage and be re- 
upon selective exposure to an activator such as elec- moved prior to polymer preparation, 
tromagnetic radiation. Examples of protective groups 65 On the substrate or a distal end of the linker mole- 
with utility herein include Nitroveratryloxy car- cules, a functional group with a protective group Po is 
bonyl, Nitrobenzyloxy carbonyl, Dimethyl dime- provided. The protective group Po may be removed 
thoxybenzyloxy carbonyl, 5-Bromo-7-nitroindolinyl, upon exposure to radiation, electric fields, electric cur- 
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rents, or other activators to expose the functional followed by contacting with Mi-P, resulting in the se- 

group. quence S-Mi-P at the first location- The second loca- 

In a preferred embodiment, the radiation is ultraviolet tions would then be irradiated and contacted with 

(UV), infrared (IR),.or visible light As more fully de- M4-P, resulting in the sequence S-M4-P at the second 

scribed below, the protective group may alternatively 5 locations. Thereafter both the first and second locations 

be an electrochemically-sensitive group which may be would be irradiated and contacted with the dimer M2- 

removed in the presence of an electric, field. In still M3, resulting in the sequence S-M1-M2-M3 at the first 

further alternative embodiments, ion beams, electron locations and S-M4-M2-M3 at the second locations. Of 

beams, or the like may be used for deprotection. course, common subsequences of any length could be 

In some embodiments, the exposed regions and, 10 utilized including those in a range of 2 or more mono 

therefore, the area upon which each distinct polymer mers> 2 to 100 monomers, 2 to 20 monomers, and a most 

sequence is synthesized are smaller than about 1 cm 2 or preferred range of 2 to 3 monomers, 

less than 1 mm 2 . In preferred embodiments the exposed According to other embodiments, a set of masks is 

area is less than about 10,000 fim 2 or, more preferably, used fof ^ ^ mon0 mer layer and, thereafter, varied 

less than 100 fxm 2 and may, in some embodiments, en- 15 Ught W avelengths are used for selective deprotection. 

compass the binding site for as few as a single molecule. Fof pxaTnpTPj ^ the p roC ess discussed above, first re- 

Within these regions, each polymer is preferably syn- ^ Qns afe ^ exposed through a mask and reacted with 

thesized in a substantially pure form. a first monomer having a first protective group Pi, 

Concurrently or after exposure of a known region of which ^ removable upon exposure to a first wavelength 
the substrate to light, the surface is contacted with a 20 of ^ , IR). Second regions are masked and re- 
first monomer unit Mi which reacts with the functional acted ^ a second monomer having a second prote- 
group which has been exposed by the deprotection step. dve p ^ which h removable upon exposure to a 
The first monomer includes a protective group Pl Pi wavelength of Hght tjv). Thereafter, 
may or may not be the same as Po- . masks become unnecessary in the synthesis because the 

Accordingly, after a first cycle, known first regions 25 ^ ^ ^ actively to the first 

of the surface may comprise the sequence: ^ ^ velQngths of Hgnt m the deprotection 

e t v# . p ( cycle. 

The polymers prepared on a substrate according to 

while reniaining regions of the surface comprise the 30 the above methods will have a variety of u^ including 
sequence: for example, screening for biological activity. In such 

. * . screening activities, the substrate containing the sequen- 

S-L-Po- ces is exposed to an unlabeled or labeled receptor such 

as an antibody, receptor on a cell, phospholipid vesicle, 
Thereafter, second regions of the surface (which may 35 or any one of a variety of other receptors. In one pre- 
include the first region) are exposed to light and con- ferred embodiment the polymers are exposed to a first, 
tacted with a second monomer M2 (which may or may unlabeled receptor of interest and, thereafter, exposed 
not be the same as Mi) having a protective group P2. P2 to a labeled receptor-specific recognition element, 
may or may not be die same as Po and Pi. After this which is, for example, an antibody. This process will 
second cycle, different regions of the substrate may provide signal amplification in the detection stage, 
comprise one or more of the following sequences: The receptor molecules may bind with one or more 

polymers on the substrate. The presence of the labeled 
S-L-M1-M2-P2 S-L-MrP2 s-L-Mi-Pi and/or receptor and, therefore, the presence of a sequence 

s-I " Po ' which binds with the receptor is detected in a preferred 

• . . i_ * * • if embodiment through the use of autoradiography, detec- 
The above process ^ r ^ r ^^ S ^^ 45 SSorescencf Uth a charge-coupled device, fluo- 
cludes desired polymers of desired lengths. By control- microscopy, or the like. The sequence of the 

ling the locations of the substrate exposed to hght and a 7£eTcS>L where the receptor binding is 

hereafter the protectivTgronps are removed from » ^^J^ff^^^S^L^ 

S"V S^fESSL £ffo5 wiSeSSceto keening for biological activity. The 
tionally, capped with a capping unit C. The process ^ howeV er, find many other uses. For 

results in a substrate having a surface with a plurality of invention wiii, nowever, ™ iuauj . 

iBuiBiu.. e r, r .»„i, example, the invention may be used in information stor- 

polymers of the following general formula. ^ ^ p ^ ^ product i 0 n of molecular 

S^HM^MiKMA) . . . QAMQ electronic devices, production of stationary phases in 

separation sciences, production of dyes and brightening 
where square brackets indicate optional groups, and M/ agents, photography, and in immobilization of cells, 
. . . Mx indicates any sequence of monomers. The num- proteins, lectins, nucleic acids, polysaccharides and the 
ber of monomers could cover a wide variety of values, 60 like in patterns on a surface via molecular recognition of 
but in a preferred embodiment they will range from 2 to specific polymer sequences. By synthesizing the same 
HXx compound in adjacent, progressively differing concen- 

In some embodiments a plurality of locations on the trations, a gradient will be established to control cbemo- 
substrate polymers are to contain a common monomer taxis or to develop diagnostic dipsticks which, for ex- 
subsequence. For example, it may be desired to synthe- 65 ample, titrate an antibody against an increasing amount 
size a sequence S-M1-M2-M3 at first locations and a of antigen. By synthesizing several catalyst molecules in 
sequence S-M4-M2-M3 at second locations. The process close proximity, more efficient multistep conversions 
would commence with irradiation of the first locations may be achieved by "coordinate immobilization." Co- 
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ordinate immobilization also may be used for electron completed substrate to interact freely with molecules 

transfer systems, as well as to provide both structural exposed to the substrate. The linker molecules should 

integrity and other desirable properties to materials be 6-50 atoms long to provide sufficient exposure. The 

such as lubrication, wetting, etc. linker molecules may be, for example, aryl acetylene, 

According to alternative embodiments, molecular 5 ethylene glycol oligomers containing 2-10 monomer 

biodistribution or pharmacokinetic properties may be units, diamines, diacids, amino acids, or combinations 

examined. For example, to assess resistance to intestinal thereof. Other linker molecules may be used in light of 

or serum proteases, polymers may be capped with a this disclsoure. 

fluorescent tag and exposed to biological fluids of inter- According to alternative embodiments, the linker 

est 10 molecules are selected based upon their hydrophilic/- 

m. Polymer Synthesis hydrophobic properties to improve presentation of syn- 

FIG. 1 illustrates one embodiment of the invention thesized polymers to certain receptors. For example, in 

disclosed herein in which a substrate 2 is shown in the case of a hydrophilic receptor, hydrophilic linker 

cross-section. Essentially, any conceivable substrate molecules will be preferred so as to permit the receptor 

may be employed in the invention- The substrate may 15 t0 more c i 0 sely approach the synthesized polymer, 

be biological, nonbiological, organic, inorganic, or a According to another alternative embodiment, linker 

combination of any of these, existing as particles, molecules are also provided with a photocleavable 

strands, precipitates, gels, sheets, tubing, spheres, con- group at an intermediate position. The photocleavable 

tainers, capillaries, pads, slices, films, plates, slides, etc. ^ pre f era bly cleavable at a wavelength different 

The substrate may have any convenient shape, such as a 20 from ^ protect ive group. This enables removal of the 

disc, square, sphere, circle, etc. The substrate is prefera- various polymers following completion of the synthesis 

bly flat but may take on a variety of alternative surface b of exposuie to ^ different wavelengths of 

configurations. For example, the substrate may contain Ug^t 

raised or depressed regions on which the synthesis takes * ^ }{nVer molecules can be attached to the substrate 
place. The substrate and its surface preferably form a 25 ^ ^^H-carbon bonds using, for example, (poly)tri- 
ngid support on which to carry out the reactions de- fluorochloroethylene surfaces, or preferably, by sflox- 
scnbed herein Tlie substrate audits surface is also m ^ * for ^ ^ Qf ^ oxide 
chosen to provide appropmtehght-absorbmg charac- surfaces) Siloxane bonds ^ ^ of me ^ 
tenstics. For instomce, ^J^* ^ bea polymer- fonned ^ one mbodiment ^ reactio as 
ized Langmuir Blodgett fikn, fimcdonahzed glass, Si, 30 ^ e / molecules bearin tnchlorosilyl groups. Hie 
Ge f GaAs, GaP, S1O2, SIN4, modified silicon, or any r: . , , * , „*%Za ^ 
™eofav*devarietyo^^^ molecules nmy opbon^Iy be attached m an r- 
te^afluoroethylene; (poly)vmyUdenedifluoride, V>£ dered Jf^ ^ °J ^d foups in a poly- 
styrene, polycarbonate! or combinations thereof. Other menzed Lan^iuir Blodgett film. In alternative embodi- 
substrate Jterials will be readily apparent to those of 35 * e ^ er molecules are adsorbed to the surface 
skill in the art upon review of this disclosure. In a pre- of s™tiate- . 
ferred embodiment the substrate is flat glass or single- ™* hnker molecules and monomers used herem are 
crystal silicon with surface relief features of less than 10 provided with a functioned group to which is bound a 
j£ protective group. Preferably, the protective group is on 

According to some embodiments, the surface of the 40 the distal or terminal end of the linker molecule oppo- 

substrate is etched using well known techniques to pro- site the substrate. The protective group may be either a 

vide for desired surface features. For example, by way negative protective group (Le., the protective group 

of the formation of trenches, v-grooves, mesa struc- renders the linker molecules less reactive with a mono- 

tures, or the like, the synthesis regions may be more nier upon exposure) or a positive protective group (Le., 

closely placed within the focus point of impinging light, 45 the protective group renders the linker molecules more 

be provided with reflective "mirror" structures for reactive with a monomer upon exposure). In the case of 

maximization of light collection from fluorescent negative protective groups an additional step of reacti- 

sources, or the like. vation will be required. In some embodiments, this will 

Surfaces on the solid substrate will usually, though be done by heating, 

not always, be composed of the same material as the 50 The protective group on the linker molecules may be 

substrate. Thus, the surface may be composed of any of selected from a wide variety of positive light-reactive 

a wide variety of materials, for example, polymers, groups preferably including nitro aromatic compounds 

plastics, resins, polysaccharides, silica or silica-based such as o-nhrobenzyl derivatives or benzylsulfonyl. In a 

materials, carbon, metals, inorganic glasses, membranes, preferred embodiment, 6-nitroveratryloxycarbonyl 

or any of the above-listed substrate materials. In some 55 (NVOC), 2-nitrobenzylpxycarbonyl (NBOC) or a,a- 
embodiments the surface may provide for the use of dimethyl-dimethoxybenzyloxycarbonyl (DDZ) is used, 

caged binding members which are attached firmly to In one embodiment; a nitro aromatic compound con- 

the surface of the substrate. Preferably, the surface will taining a benzylic hydrogen ortho to the nitro group is 

contain reactive groups, which could be carboxyl, used, Le., a chemical of the form: 

amino, hydroxyl, or the like. Most preferably, the sur- 60 
face will be optically transparent and will have surface 
Si— OH functionalities, such as are found on silica sur- 
faces. 

The surface 4 of the substrate is preferably provided 

with a layer of linker molecules 6, although it will be 65 
understood that the linker molecules are not required 
elements of the invention. The linker molecules are 
preferably of sufficient length to permit polymers in a 
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where Ri is alkoxy, alkyl, halo, aryl, alkeriyl, or hydro- comprise a molecule which is decomposed by light such 

gen; R2 is alkoxy, alkyl, halo, aryl, nitro, or hydrogen; as quinone diazide or a material which is transiently 

R3 is alkoxy, alkyl, halo, nitro, aryl, or hydrogen; R4 is bleached at the wavelength of interest Transient 

alkoxy, alkyl, hydrogen, aryl, halo, or nitro; and R5 is bleaching of materials will allow greater penetration 

alkyl, alkynyl, cyano, alkoxy, hydrogen, halo, aryl, or 5 where light is applied, thereby enhancing contrast Al- 

alkenyl. Other materials . which may be used include ternatively, contrast enhancement may be provided by 

o-hydroxy-a-methyl cinnamoyl derivatives. Photore- way of a cladded fiber optic bundle, 

movable protective groups are described in, for exam- The light may be from a conventional incandescent 

pie, Patchornik, /. Am. Chem. Soc. (1970) 92:6333 and source, a laser, a laser diode, or the like. If non-col- 

Am it et al., /. Org. Chem. (1974) 39:192, both of which 10 limated sources of light are used it may be desirable to 

are incorporated herein by reference. provide a thick- or multi-layered mask to prevent 

In an alternative embodiment the positive reactive spreading of the light onto the substrate. It may, further, 

group is activated for reaction with reagents in solution. ^ desirable in some embodiments to utilize groups 

For example, a 5-bromo-7-nitro indoline group, when which are sensitive to different wavelengths to control 

bound to a carbonyl, undergoes reaction upon exposure 15 synthesis. For example, by using groups which are sen- 

to light at 420 nm. sitive to different wavelengths, it is possible to select 

In a second alternative embodiment, the reactive branch positions in the synthesis of a polymer or elimi- 

group on the linker molecule is selected from a wide nate cer tain masking steps. Several reactive groups 

variety of negative light-reactive groups including a ^ ^th their corresponding wavelengths for depro- 

cinammate group. 20 tection are provided in Table 1. 

Alternatively, the reactive group is activated or deac- taTU E 1 

tivated by electron beam lithography, x-ray lithogra- 1 

phy, or any other radiation. Suitable reactive groups for Approximate 

electron beam lithography include sulfonyl. Other Group Deprotection Wavelength 

methods may be used including, for example, exposure 25 Nitroveratryioxy carbonyl (NVOC) uv (30CMO0 nm) 

to a current source^ reactive groups and methods EgKSg 

of activation may be used in light of this disclosure. Stayi y 

As shown in FIG. % the linking molecules are prefer- 5-Bromo-7-nitroindolinyl UV (420 nm) 

ably exposed tO, for example, light through a suitable o-Hydroxy-a-methyl cinnamoyl UV (300-350 nm) 

mask 8 using photolithographic techniques of the type 30 2-Oxymcthylene anthraqoinone UV (350 nm) 

known in the semiconductor industry and described in, 

for example, Sze, VLSI Technology, McGraw-Hill While the invention is illustrated primarily herein by 
(1983), and Mead et al., Introduction to VLSI Systems, way Q f ^ ^ 0 f a mask to illuminate selected regions 
Addison-Wesley (1980), which are incorporated herein ^ Q ^ eT techniques may also be used. For 

by reference for all purposes. The light may be directed 35 examp j e) ^ e substrate may be translated under a modu- 
at either the surface containing the protective groups or ^ted lflser Qr Hght source . Such techniques are 
at the back of the substrate, so long as the substrate is discussed ^ for example, U.S. Pat No. 4,719,615 
transparent to the wavelength of light needed for re- (F eyrer et ^ t which is incorporated herein by refer- 
moval of the protective groups. In the embodiment ^ embodiments a laser galvanometric 

shown in FIG. % light is directed at the surface of the 40 samner ^ utflized ^ other embodiments, the synthesis 
substrate containing the protective groups. FIG. 1 Ulus- ^ . Qn of ^ coaSzsX ^ a conventional 

trates the use of such masking techniques as they are ^ {referTed t0 herein ^ a « Ught va lve") or 

applied to a positive reactive group so as to activate ^ fi opriately modulating 

linking molecules and expose functional groups m areas ^ ^ sta ^ Hght may be selectively controlled so as 
10a and 106. 45 permit light to contact selected regions of the sub- 

The mask 8 is in one ^^^IX^l straVe. Alternatively, synthesis may take place on the 
port material selectively coated with a layer of opaque ' ^ to which U ght is selec- 

material. Portions of the opaque ma ^^ e ^ iJE Other means of controlling the location 

leaving opaque material in the ^!^^^J^ „ *f H LSposure will be apparent to those of skill in the 
on the substrate surface. The mask is brought into close 50 U1 "s" 1 - . 

Droximity with, imaged on, or brought directly into ar ^v , , . . , ;„ ~^+*r+ Ar 

£3with the subftrate surface as shown in FIG. ± The substrate may be uradiated erther .n 

"Openings" in the mask correspond to locations on the »<* in contact with a solution (not shown) and is, prefer- 

sute^ whLkbderired to^nove photoremovable ably, irradiated in contact with a solution. The solution 

prSeg^oup? from the substrate. Alignment may be 55 :^ r ^ CTte . to J^ 

performed ^mfconventional alignment techniques in irradiation from interfering with synthesis of the poly- 

wMchXS marks (not shown) are used to accii- mer according to some embodiments Such byproducts 

rately overlay successive masks with previous pattern- might include, for example, carbon l diox.de, mtrosocar- 

ing steps, or more sophisticated techniques may be used. bonyl compounds, styrene derivatives, indole denva- 

For example, mterferometric techniques such as the one 60 lives, and products of their photochemical reactions, 

described in Flanders et al., "A New Interferometric Alternatively, the solution rnay contain reagento used to 

Alignment Technique," App. Phys. Lea. (1977) match the index of refraction of the substrate. Reagents 

31:426-428, which is incorporated herein by reference, added to the solution may further mclude, for example 

may be used. acidic or basic buffers, thiols, substituted hydrazines and 

To enhance contrast of Ught applied to the substrate, 65 hydroxylarnines, reducing agents (eg., NADH) or rea- 

it is desirable to provide contrast enhancement materials gents known to react with a given functional group 

between the mask and the substrate according to some (e.g., aryl nitroso+glyoxylic acid— aryl formhydrox- 

embodiments. This contrast enhancement layer may amate+CCh). 
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Either concurrently with or after the irradiation step, According to some embodiments, several sequences 
the KtiVw molecules are washed or otherwise contacted are intentionally provided within a single region so as to 
with a first monomer, illustrated by "A'* in regions 12a provide an initial screening for biological activity, after 
and 12b in FIG. 2. The first monomer reacts with the which materials within regions exhibiting significant 
activated functional groups of the linkage molecules 5 binding are further evaluated. 

which have been exposed to light The first monomer, IV. Details of One Embodiment of a Reactor System 
which is preferably an amino acid, is also provided with FIG. 8A schematically illustrates a preferred embodi- 
a photoprotective group. The photoprotective group ment of a reactor system 100 for synthesizing polymers 
on the monomer may be the same as or different than on the prepared substrate in accordance with one aspect 
the protective group used in the linkage molecules, and 10 of the invention. The reactor system includes a body 
may be selected from any of the above-described pro- 102 with a cavity 104 on a surface thereof. In preferred 
tective groups. In one embodiment, the protective embodiments the cavity 104 is between about 50 and 
groups for the A monomer is selected from the group 1000 fim deep with a depth of about 500 \im preferred. 
NBOC and NVOC. The bottom of the cavity is preferably provided with 

As shown in FIG. 3, the process of irradiating is 15 an array of ridges 106 which extend both into the plane 
thereafter repeated, with a mask repositioned so as to of the Figure and parallel to the plane of the Figure, 
remove linkage protective groups and expose functional The ridges are preferably about 50 to 200 u-m deep and 
groups in regions 14c and 146 which are illustrated as spaced at about 2 to 3 mm. The purpose of the ridges is 
being regions which were protected in the previous to generate turbulent flow for better mixing. The bot- 
masking step. As an alternative to repositioning of the 20 torn surface of the cavity is preferably light absorbing so 
first mask, in many embodiments a second mask will be as to prevent reflection of impinging light 
utilized. In other alternative embodiments, some steps A substrate 112 is mounted above the cavity 104. The 
may provide for iUuminating a common region in sue- substrate is provided along its bottom surface 114 with 
cessive steps. As shown in FIG. 3, it may be desirable to a photoremovable protective group such as NVOC 
provide separation between irradiated regions. For ex- 25 with or without an intervening linker molecule. The 
ample, separation of about 1-5 u-m may be appropriate substrate is preferably transparent to a wide spectrum of 
to account for alignment tolerances. light, but in some embodiments is transparent only at a 

As shown in FIG. 4, the substrate is then exposed to wavelength at which the protective group may be re- 
st second protected monomer "B," producing B regions moved (such as UV in the case of NVOC). The sub- 
Ida and 166. Thereafter, the substrate is again masked so 30 strate in some embodiments is a conventional micro- 
as to remove the protective groups and expose reactive scope glass slide or cover slip. The substrate is prefera- 
groups on A region 12a and B region 16J>. The substrate bly as thin as possible, while still providing adequate 
is again exposed to monomer B, resulting in the forma- physical support Preferably, the substrate is less than 
tion of the structure shown in FIG. 6. The dimers B-A about 1 mm thick, more preferably less than 0.5 mm 
and B-B have been produced on the substrate. 35 thick, more preferably less than 0.1 mm thick, and most 

A subsequent series of masking and contacting steps preferably less than 0.05 mm thick. In alternative pre- 
similar to those described above with A (not shown) ferred embodiments, the substrate is quartz or silicon, 
provides the structure shown in FIG. 7. The process The substrate and the body serve to seal the cavity 
provides all possible dimers of B and A, Le., B-A, A-B, except for an inlet port 108 and an outlet port 110. The 
A-A, and B-B. 40 body and the substrate may be mated for sealing in some 

The substrate, the area of synthesis, and the area for embodiments with one or more gaskets. According to a 
synthesis of each individual polymer could be of any preferred embodiment, the body is provided with two 
size or shape. For example, squares, ellipsoids, rectan- concentric gaskets and the intervening space is held at 
gles, triangles, circles, or portions thereof, along with vacuum to ensure mating of the substrate to the gaskets, 
irregular geometric shapes, may be utilized. Duplicate 45 Fluid is pumped through the inlet port into the cavity 
synthesis areas may also be applied to a single substrate by way of a pump 116 which may be, for example, a 
for purposes of redundancy. model no. B-120-S made by Eldex Laboratories. Se- 

In one embodiment the regions 12a, 12b and 16a, 16b lected fluids are circulated into the cavity by the pump, 
on the substrate will have a surface area of between through the cavity, and out the outlet for recirculation 
about 1 cm 2 and 10- 10 cm 2 . In some embodiments the 50 or disposal The reactor may be subjected to ultrasonic 
regions 12a, Mb and 16a, 16b have areas of less than radiation and/or heated to aid in agitation in some em- 
about 10- l cm 2 f 10- 2 cm 2 , 10~ 3 cm 2 , 10- 4 cm 2 , 10" 5 bodiments. 

cm 2 10- 6 cm 2 , 10- 7 cm 2 , 10-8 cm 2 , or 10- 10 cm 2 . In a Above the substrate 112, a lens 120 is provided which 
preferred embodiment, the regions 12a, 12* and 16a, may be, for example, a 2" 100 mm focal length fused 
16b are between about 10X 10 urn and 500X500 um 55 silica lens. For the sake of a compact system, a reflective 
In some embodiments a single substrate supports mirror 122 may be provided for directing light from a 
more than about 10 different monomer sequences and light source 124 onto the substrate. Light source 124 
perferably more than about 100 different monomer may be, for example, a Xe(Hg) light source manufac- 
sequences, although in some embodiments more than tured by Oriel and having model no. 66024. A second 
about 10 3 , 10*, 10 5 , 10 6 , 10 7 , or 10 8 different sequences 60 lens 126 may be provided for the purpose of projecting 
are provided on a substrate. Of course, within a region a mask image onto the substrate in combination with 
of the substrate in which a monomer sequence is synthe- lens 120. This form of lithography is referred to herein 
sized, it is preferred that the monomer sequence be as projection printing. As will be apparent from this 
substantially pure. In some embodiments, regions of the disclosure, proximity printing and the like may also be 
substrate contain polymer sequences which are at least 65 used according to some embodiments, 
about 1%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, Light from the light source is permitted to reach only 
45%, 50%, 60%, 70%, 80%, 90%, 95%, 96%, 97% selected locations on the substrate as a result of mask 
98%'or 99% pure. 128. Mask 128 may be, for example, a glass slide having 
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etched chrome thereon. The mask 128 in one embodi- 
ment is provided with a grid of transparent locations 
and opaque locations. Such masks may be manufactured 
by, for example, Photo Sciences, Inc. Light passes 
freely through the transparent regions of the mask, but 5 
is reflected from or absorbed by other regions. There- 
fore, only selected regions of the substrate are exposed 
to light 

As discussed above, light valves (LCD's) may be 
used as an alternative to conventional masks to selec- 10 
tiveLy expose regions of the substrate. Alternatively, 
fiber optic faceplates such as those available from 
Schott Glass, Inc, may be used for the purpose of con- 
trast enhancement of the mask or as the sole means of 
restricting the region to which light is applied. Such 15 
faceplates would be placed directly above or on the 
substrate in the reactor shown in FIG. 8A. In still fur- 
ther embodiments, flys-eye lenses, tapered fiber optic 
faceplates, or the like, may be used for contrast en- 
hancement 20 

In order to provide for iUumination of regions smaller 
than a wavelength of light, more elaborate techniques 
may be utilized. For example, according to one pre- 
ferred embodiment, light is directed at the substrate by 
way of molecular microcrystals on the tip of, for exam- 25 
pie, micropipettes. Such devices are disclosed in Lieber- 
man et al., "A Light Source Smaller Than the Optical 
Wavelength," Science (1990) 247:59-61, which is incor- 
porated herein by reference for all purposes. 

In operation, the substrate is placed on the cavity and 30 
sealed thereto. All operations in the process of prepar- 
ing the substrate are carried out in a room lit primarily 
or entirely by light of a wavelength outside of the light 
range at which the protective group is removed. For 
example, in the case of NVOC, the room should be lit 35 
with a conventional dark room light which provides 
little or no UV light All operations are preferably con- 
ducted at about room temperature. 

A first, deprotection fluid (without a monomer) is 
circulated through the cavity. The solution preferably is 40 
of 5 mM sulfuric acid in dioxane solution which serves 
to keep exposed amino groups protonated and decreases 
their reactivity with photolysis by-products. Absorp- 
tive materials such as N,N-diethylamino 2,4-dinitroben- 
zene, for example, may be included in the deprotection 45 
fluid which serves to absorb light and prevent reflection 
and unwanted photolysis. 

The slide is, thereafter, positioned in a light raypath 
from the mask such that first locations on the substrate 
are illuminated and, therefore, deprotected. In pre- 50 
ferred embodiments the substrate is iUuminated for be- 
tween about 1 and 15 minutes with a preferred illumina- 
tion time of about 10 minutes at 10-20 mW/cm 2 with 
365 nm light The slides are neutralized (Le., brought to 
a pH of about 7) after photolysis with, for example, a 55 
solution of dl-isopropylethylamine (DIEA) in methy- 
lene chloride for about 5 minutes. 

The first monomer is then placed at the first locations 
on the substrate. After irradiation, the slide is removed, 
treated in bulk, and then reinstalled in the flow cell. 60 
Alternatively, a fluid containing the first monomer, 
preferably also protected by a protective group, is cir- 
culated through the cavity by way of pump 116. If, for 
example, it is desired to attach the amino acid Y to the 
substrate at the first locations, the amino acid Y (bearing 65 
a protective group on its a-nitrogen), along with rea- 
gents used to render the monomer reactive, and/or a 
carrier, is circulated from a storage container 118, 
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through the pump, through the cavity, and back to the 
inlet of the pump. 

The monomer carrier solution is, in a preferred em- 
bodiment, formed by mixing of a first solution (referred 
to herein as solution "A") and a second solution (re- 
ferred to herein as solution "B"). Table 2 provides an 
illustration of a mixture which may be used for solution 
A. 

TABLE 2 

Representative Monomer Carrier Solution "A" 

100 mg NVOC amino protected amino acid 
37 mg HOBT (1-HydroxybenzotrUzole) 
250 fil DMF (Dimethylfonnamide) 
86 jil DIEA O^^propyletkylamine) 



The composition of solution B is illustrated in Table 
3. Solutions A and B are mixed and allowed to react at 
room temperature for about 8 minutes, then diluted 
with 2 ml of DMF, and 500 ul are applied to the surface 
of the slide or the solution is circulated through the 
reactor system and allowed to react for about 2 hours at 
room temperature. The slide is then washed with DMF, 
methylene chloride and ethanol. 

TABLE 3 

Representati ve Monomer Carrier Solution M B" 

250 jil DMF 

111 mg BOP (Ben20triazolyl-n^xy-tris(oUmemyIamino) 

phosphoniuznhexafluorophosphate) 



As the solution containing the monomer to be at- 
tached is circulated through the cavity, the amino acid 
or other monomer will react at its carboxy terminus 
with amino groups on the regions of the substrate which 
have been deprotected. Of course, while the invention 
is illustrated by way of circulation of the monomer 
through the cavity, the invention could be practiced by 
way of removing the slide from the reactor and sub- 
mersing it in an appropriate monomer solution. 

After addition of the first monomer, the solution 
containing the first amino acid is then purged from the 
system. After circulation of a sufficient amount of the 
DMF/methylene chloride such that removal of the 
amino acid can be assured (e.g., about 50 X times the 
volume of the cavity and carrier lines), the mask or 
substrate is repositioned, or a new mask is utilized such 
that second regions on the substrate will be exposed to 
light and the light 124 is engaged for a second exposure. 
This will deprotect second regions on the substrate and 
the process is repeated until the desired polymer se- 
quences have been synthesized. 

The entire derivatized substrate is then exposed to a 
receptor of interest, preferably labeled with, for exam- 
ple, a fluorescent marker, by circulation of a solution or 
suspension of the receptor through the cavity or by 
contacting the surface of the slide in bulk. The receptor 
will preferentially bind to certain regions of the sub- 
strate which contain complementary sequences. 

Antibodies are typically suspended in what is com- 
monly referred to as "supercocktail," which may be, for 
example, a solution of about \% BSA (bovine serum 
albumin), 0.5% TweenTM non-ionic detergent in PBS 
(phosphate buffered saline) buffer. The antibodies are 
diluted into the supercocktail buffer to a final concen- 
tration of, for example, about 0.1 to 4 ug/ml. 

FIG. 8B illustrates an alternative preferred embodi- 
ment of the reactor shown in FIG. 8A. According to 
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this embodiment, the mask 128 is placed directly in pled; followed by a third mask, for the C column; and a 

contact with the substrate. Preferably, the etched por- final mask that exposes the right-most column, for D. 

tion of the mask is placed face down so as to reduce the The first, second, third, and fourth masks may be a 

effects of light dispersion. According to this embodi- single mask translated to different locations. 

m ent, the i maging lenses 120 and 126 are not necessary 5 The process is repeated in the horizontal direction for 

because the mask is brought into close proximity with the second unit of the dimer. This time, the masks allow 

the substrate. exposure of horizontal rows, again 0.25 cm wide. A, B, 

For purposes of increasing the signal-to-noise ratio of C, and D are sequentially coupled using masks that 

the technique, some embodiments of the invention pro- expose horizontal fourths of the reaction area. The 

vide for exposure of the substrate to a first labeled or 10 resulting substrate contains all 16 dinucleo tides of four 

unlabeled receptor followed by exposure of a labeled, bases. 

second receptor (eg., an antibody) which binds at mul- The eight masks used to synthesize the dinucleotide 
tiple sites on the first receptor. If, for example, the first are related to one another by translation or rotation. In 
receptor is an antibody derived from a first species of an fact, one mask can be used in all eight steps if h is suit- 
animal, the second receptor is an antibody derived from 15 ably rotated and translated. For example, in the example 
a second species directed to epitopes associated with the above, a mask with a single transparent region could be 
first species. In the case of a mouse antibody, for exam- sequentially used to expose each of the vertical col- 
ple, fluorescently labeled goat antibody or antiserum umns, translated 90% and then sequentially used to 
which is antimouse may be used to bind at multiple sites allow exposure of the horizontal rows, 
on the mouse antibody, providing several times the 20 Tables 4 and 5 provide a simple computer program in 
fluorescence compared to the attachment of a single Quick Basic for planning a m as kin g program and a 
mouse antibody at each binding site. This process may sample output, respectively, for the synthesis of a poly- 
be repeated again with additional antibodies (e.g., goat- mer chain of three monomers ("residues") having three 
mouse-goat, etc.) for further signal amplification. different monomers in the first level, four different mon- 

In preferred embodiments an ordered sequence of 25 omers in the second level, and five different monomers 

masks is utilized. In some embodiments it is possible to in the third level in a striped pattern. The output of the 

use as few as a single mask to synthesize all of the possi- program is the number of cells, the number of "stripes" 

ble polymers of a given monomer set (light regions) on each mask, and the amount of transla- 

If, for example, it is desired to synthesize all 16 dinu- tion required for each exposure of the mask. 

TABLE 4 

Mask Strategy Program • 



DEFINT A-Z 

DIM b(20), w(20), 1(500) 

FS = "LPT I:** 

OPEN fS FOR OUTPUT AS #1 
jmax » 3 'Number of residues 

b(l) = 3: b(2) = 4: b<3) = 5 'Number of building blocks for res 1,2,3 
g ~ 1: lmax(l) = 1 

FOR j = 1 TO jmax: g= g»b(j):NEXTj 
w(0) = fcwa)~g/bO) 

PRINT #1, "MASK2.BAS DATES, TIMES: PRINT #1, 
PRINT #1. USING "Number of residues =##"; jmax 
FOR j = 1 TO jmax 

PRINT #1, USING " Residue ## ## building blocks"; j; b(j) 
NEXT j 

PRINT #1, - - 

PRINT #1, USING "Number of cells= g: PRINT #1, 

FOR j = 2 TO jmax 

ImaxQ = lmax(j - 1) • b(j - 1) 

wO) = wO- l)/b(D 

NEXT j 

FOR j = 1 TO jmax 

PRINT #1, USING "Mask for residue j: PRINT #1, 
PRINT #1, USING" Number of stripes ImaxQ 
PRINT #1, USING" Width of each stripe=###";wQ) 
FOR 1 = 1 TO lmaxO) 
a = 1 + (1 — I) * w(j — 1) 
ae = a + w(j) — 1 

PRINT #1, USING " Stripe M begins at location ### and ends at 1; a; ae 

NEXT 1 
PRINT #1, 

PRINT #1, USING ** For each of ## building blocks, translate mask by ## 
celKs)"; bOT; wQ. 

PRINT #1, : PRINT #1, : PRINT #1, 

NEXT j 
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cleotides from four bases, a 1 cm square synthesis region 

is divided conceptually into 16 boxes, each 0.25 cm TABLE 5 

wide. Denote the four monomer units by A, B, C, and " Masking Strategy Output 

D. The first reactions are carried out in four vertical 65 — — -— — — — — 

columns, each 0.25 cm wide. The first mask exposes the ^ M x 3 building blocks 

left-most column of boxes, where A is coupled. The Residue 2 4 bonding blocks 

second mask exposes the next column, where B is cou- Residue 3 5 building blocks 
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TABLE 5-contbued 
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Masking Strategy Output 



10 



15 



20 



25 



30 



Number of ceDs= 60 
Mask for residue I 

Number of stripes = 1 

Width of each stripe = 20 

Stripe 1 begins at location 1 and ends at 20 

For each of 3 building blocks, translate mask by 20 cell(s) " 
Mask for residue 2 

Number of stripes = 3 

Width of each stripes 5 

Stripe 1 begins at location 1 and ends at 5 

Stripe 2 begins at location 21 and ends at 25 . 

Stripe 3 begins at location 41 and ends at 45 
For each of 4 building blocks, translate mask by 5 ceU(s) 
Mask for residue 3 

Number of stripes = 12 
Width of each stripe = 1 
Stripe 1 begins at location 1 and ends at 1 
Stripe 2 begins at location 6 and ends at 6 
Stripe 3 begins at location 11 and ends at 1 1 
Stripe 4 begins at location 16 and ends at 16 
Stripe 5 begins at location 21 and ends at 21 
Stripe 6 begins at location 26 and ends at 26 
Stripe 7 begins at location 31 and ends at 31 
Stripe 8 begins at location 36 and ends at 36 
Stripe 9 begins at location 41 and ends at 41 
Stripe 10 begins at location 46 and ends at 46 
Stripe 1 1 begins at location 51 and ends at 51 
Stripe 12 begins at location 56 and ends at 56 
Fo r each of 5 building blocks, translate mask by I cell(s) 

© Copyright 1990, Aftymax Research Institute 

V. Details of One Embodiment of A Fluorescent De- 
tection Device 

FIG. 9 illustrates a fluorescent detection device for 
detecting fiuorescently labeled receptors on a substrate, 
A substrate 112 is placed on an x/y translation table 202. 
In a preferred embodiment the x/y translation table is a 35 
model no. PM500-A1 manufactured by Newport Cor- 
poration. The x/y translation table is connected to and 
controlled by an appropriately programmed digital 
computer 204 which may be, for example, an appropri- 
ately programmed IBM PC/AT or AT compatible 
computer. Of course, other computer systems, special 
purpose hardware, or the like could readily be substi- 
tuted for the AT computer used herein for illustration. 
Computer software for the translation and data collec- 
tion functions described herein can be provided based 45 
on commercially available software including, for ex- 
ample, "Lab Windows" licensed by National Instru- 
ments, which is incorporated herein by reference for all 
purposes. 

The substrate and x/y translation table are placed 50 
under a microscope 206 which includes one or more 
objectives 208. Light (about 488 nm) from a laser 210, 
which in some embodiments is a model no. 2020-05 
argon ion laser manufactured by Spectraphysics, is di- 
rected at the substrate by a dichroic mirror 207 which 55 
passes greater than about 520 nm light but reflects 488 
nm light. Dichroic mirror 207 may be, for example, a 
model no. FT510 manufactured by Carl Zeiss. Light 
reflected from the mirror then enters the microscope 
206 which may be, for example, a model no. Axioscop 60 
20 manufactured by Carl Zeiss. Fluorescein-marked 
materials on the substrate will fluoresce >488 nm light, 
and the fluoresced light will be collected by the micro- 
scope and passed through the mirror. The fluorescent 
light from the substrate is then directed through a wave- 65 
length filter 209 and, thereafter through an aperture 
plate 211. Wavelength filter 209 may be, for example, a 
model no. OG530 manufactured by Melles Griot and 



aperture plate 211 may be, for example, a model no. 
477352/477380 manufactured by Carl Zeiss. 

The fluoresced light then enters a photomultiplier 
tube 212 which in some embodiments is a model no. 
R943-02 manufactured by Hamamatsu, the signal is 
amplified in preamplifier 214 and photons are counted 
by photon counter 216. The number of photons is re- 
corded as a function of the location in the computer 204. 
Pre-Amp 214 may be, for example, a model no. SR440 
manufactured by Stanford Research Systems and pho- 
ton counter 216 may be a model no. SR400 manufac- 
tured by Stanford Research Systems. The substrate is 
then moved to a subsequent location and the process is 
repeated. In preferred embodiments the data are ac T 
quired every 1 to 100 ;xm with a data collection diame- 
ter of about 0.8 to 10 ftm preferred. In embodiments 
with sufficiently high fluorescence, a CCD (change 
coupled device) detector with broadfield illumination is 
utilized. 

By counting the number of photons generated in a 
given area in response to the laser, it is possible to deter- 
mine where fluorescent marked molecules are located 
on the substrate. Consequently, for a slide which has a 
matrix of polypeptides, for example, synthesized on the 
surface thereof, it is possible to determine which of the 
polypeptides is complementary to a fiuorescently 
marked receptor. 

According to preferred embodiments, the intensity 
and duration of the light applied to the substrate is con- 
trolled by varying the laser power and scan stage rate 
for improved signal-to-noise ratio by maximizing fluo- 
rescence emission and minimizing background noise. 

While the detection apparatus has been illustrated 
primarily herein with regard to the detection of marked 
receptors, the invention will find application in other 
areas. For example, the detection apparatus disclosed 
herein could be used in the fields of catalysis, DNA or 
protein gel scanning, and the' like. 
VI. Determination of Relative Binding Strength of 
Receptors 

The signal-to-noise ratio of the present invention is 
sufficiently high that not only can the presence or ab- 
sence of a receptor on a ligand be detected, but also the 
relative binding affinity of receptors to a variety of 
sequences can be determined. 

In practice it is found that a receptor will bind to 
several peptide sequences in an array, but will bind 
much more strongly to some sequences than others. 
Strong binding affinity will be evidenced herein by a 
strong fluorescent or radiographic signal since many 
receptor molecules will bind in a region of a strongly 
bound ligand. Conversely, a weak binding affinity will 
be evidenced by a weak fluorescent or radiographic 
signal due to the relatively small number of receptor 
molecules which bind in a particular region of a sub- 
strate having a ligand with a weak binding affinity for 
the receptor. Consequently, it becomes possible to de- 
termine relative binding avidity (or affinity in the case 
of univalent interactions) of a ligand herein by way of 
the intensity of a fluorescent or radiographic signal in a 
region containing that ligand. 

Semiquantitative data on affinities might also be ob- 
tained by varying washing conditions and concentra- 
tions of the receptor. This would be done by compari- 
son to known ligand receptor pairs, for example. 
VTL Examples 

The following examples are provided to illustrate the 
efficacy of the inventions herein. All operations were 
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conducted at about ambient temperatures and pressures is to be added, with appropriate washes to remove 

unless indicated to the contrary. the by-products of the deprotection. 

A. Slide Preparation 2. Addition of a single activated and protected (with 
Before attachment of reactive groups it is preferred to the same photochemically-removable group) mon- 

clean the substrate which is, in a preferred embodiment 5 omer, which will react only at the sites addressed 

a glass substrate such as a microscope slide or cover m step 1, with appropriate washes to remove the 

slip. According to one embodiment the slide is soaked in excess reagent from the surface, 

an alkaline bath consisting of, for example, 1 liter of The above cycle is repeated for each member of the 

95% ethanol with 120 ml of water and 120 grams of monomer set until each location on the surface has been 
sodium hydroxide for 12 hours. The slides are then 10 extended by one residue in one embodiment In other 

washed under ninning water and allowed to air dry, embodiments, several residues are sequentially added at 

and rinsed once with a solution of 95% ethanoL one i ocat i on before moving on to the next location. 

The slides are then aminated with, for example, Cy C j e rim^ will generally be limited by the coupling 

aminopropyltriethoxysilane for the purpose of attach- react j on ratej now 35 s hort as 20 min in automated pep- 

ing amino groups to the glass surface on linker mole- 15 tide synthesi2ers . This step is optionally followed by 

cules, although any omega functionalized silane could additkm rf a tectin t0 stabilize ^ for 

also be used for this purpose. In one embodiment 0.1% testing. For some types of polymers (eg., pep- 

ammopropyltnethoxysilane is utilized, although solu- fm * ctkm ^ entire surface (removal 

tun wrth concen^ons from l<£^ 1 ™ "^ 0 f photoprotective side chain groups) may be required, 

used, with about :10-3% to 2% V"*^** 1 *™ 20 More Jarucularly, as shown in PIG. 10A, the glass20 

ture is prepared by addmg to 100 ml of a 95% A . p - x - B -« \* A 

ethanol/5% water mixture, 100 nucroliters (pi) of » < Pf<>vided with regions 22 24, 26, 2S 30, 32 34, and 

aminopropyltriethoxysilane. The mixture is agitated at £ »»» f* 2 > * TJTf 
about ambient temperature on a rotary shaker for about ™- "® and the ghss is irradiated and exposed o a 
5 minutes. 500 ul of this mixture is then applied to the 25 reagent containg A" (e.g., gly), wtfh the resulting 
surface of one side of each cleaned slide. After 4 min- structure shown - m FIG. IOC. Thereafter, regions 22, 
utes, the slides are decanted of this solution and rinsed 24> 26* and 28 are masked, the glass is irradiated (as 
three times by dipping in, for example, 100% ethanol. shown in FIG. 10D) and exposed to a reagent contain- 
After the plates dry, they are placed in a 1 10M20 0 C ing "B" (e.g., phe), with the resulting structure shown 
vacuum oven for about 20 minutes, and then allowed to 30 in FIG. 10E. The process proceeds, consecutively 
cure at room temperature for about 12 hours in an argon masking and exposing the sections as shown until the 
environment The slides are then dipped into DMF structure shown in FIG. 10M is obtained. The glass is 
(dimethylformamide) solution, followed by a thorough irradiated and the terminal groups are, optionally, 
washing with methylene chloride. capped by acetylation. As shown, all possible trimers of 

The aminated surface of the slide is then exposed to 35 gly/phe are obtained, 
about 500 pi of, for example, a 30 mfllimolar (mM) In this example, no side chain protective group re- 
solution of NVOC-GABA (gamma amino butyric acid) moval is necessary. If it is desired, side chain deprotec- 
NHS (N-hydroxysuccinimide) in DMF for attachment tion may be accomplished by treatment with ethanedi- 
of a NVOC-GABA to each of the amino groups. thiol and trifluoroacetic acid. 

The surface is washed with, for example, DMF, 40 In general, the number of steps needed to obtain a 

methylene chloride, and ethanol. particular polymer chain is defined by: 

Any unreacted aminopropyl silane on the surface — 

that is, those amino groups which have not had the nxl 0) 
NVOC-GABA attached— are now capped with acetyl 

groups (to prevent further reaction) by exposure to a 1:3 45 where: 

mixture of acetic anhydride in pyridine for 1 hour. n= the number of monomers in the basis set of mono- 
Other materials which may perform this residual cap- mers, and 

ping function include trifluoroacetic anhydride, for- l=the number of monomer units in a polymer chain, 

micacetic anhydride, or other reactive acylating agents. Conversely, the synthesized number of sequences of 

Finally, the slides are washed again with DMF, methy- 50 length 1 will be: 
lene chloride, and ethanol. 

B. Synthesis of Eight Trimers of "A" and "B' ? <*> 
FIG. 10 illustrates a possible synthesis of the eight 

trimers of the two-monomer set: gly, phe (represented Of course, greater diversity is obtained by using 

by "A" and "B," respectively). A glass slide bearing 55 masking strategies which will also include the synthesis 

silane groups terminating in 6-nitroveratryloxycarboxa- of polymers having a length of less than 1. If, in the 

mide (NVOC-NH) residues is prepared as a substrate. extreme case, all polymers having a length less than or 

Active esters (pentafluorophenyl, OBt, etc.) of gly and equal to 1 are synthesized, the number of polymers syn- 

phe protected at the amino group with NVOC are pre- thesized will be: 
pared as reagents. While not pertinent to this example, if 60 

side chain protecting groups are required for the mono- n'+n' - 1 + . . . -t-n 1 . (3) 
mer set, these must not be photoreactive at the wave- 
length of light used to protect the primary chain. The maximum number of lithographic steps needed 

For a monomer set of size n, nxl cycles are required will generally be n for each "layer" of monomers, Le., 

to synthesize all possible sequences of length 1. A cycle 65 the total number of masks (and, therefore, the number 

consists of: of lithographic steps) needed will be nXL The size of 

1. Irradiation through an appropriate mask to expose the transparent mask regions will vary in accordance 

the amino groups at the sites where the next residue with the area of the substrate available for synthesis and 
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the number of sequences to be formed. In general, the pregnated with a known number of fluorescein mole- 
size of the synthesis areas will be: cules. 

One of the beads was placed in the illumination field 

size of synthesis areas =(A)/(Sequences) 0 n the scan stage as shown in FIG. 9 in a field of a laser 

5 spot which was initially shuttered. After being posi- 

wne re: , . , tioned in the fflurnination field, the photon detection 

A is the total area available for synthesis; and equipment was turned on. The laser beam was un- 

Sequences is the number of sequences desired m the Wocked ^ ft ^ ^ particle bead( which 



area. 



r .... . , , . - . .„ . . . then fluoresced. Fluorescence curves of beads impreg- 

It will be appreciated by those of skill m the art that in ™ n , ... ^ • ™ 

the above method could readily be used to simulta- 10 nated ™m 7,000 and 13,000; fluoresces molecules, are 
neously produce thousands or millions of oligomers on m ^ } 1A A md } 1B respectively. On each 

a substrate using the photohthographic techniques dis- curve > ^f 5 for fluorescem molecules 

closed herein. Consequently, the method results in the ■« ^ 0WQ - ™*« experiments were performed 

ability to practically test large numbers of, for example, „ ™th 488 ™ excitation, with 100 u-W of laser power, 

di, tri, tetra, penta, hexa, hepta, octapeptides, dodeca- " The hght was focused through a 40 power 0.75 NA 

peptides, or larger polypeptides (or correspondingly, objective. 

polynucleotides). The fluorescence intensity in all cases started off at a 
The above example has illustrated the method by way high value and then decreased exponentially. The fall- 
of a manual example. It will of course be appreciated off in intensity is due to photobleaching of the fluores- 
that automated or semi-automated methods could be cein molecules. The traces of beads without fluorescein 
used. The substrate would be mounted in a flow cell for molecules are used for background subtraction. The 
automated addition and removal of reagents, to mini- difference in the initial exponential decay between la- 
mize the volume of reagents needed, and to more care- beled and nonlabeled beads is integrated to give the 
fully control reaction conditions. Successive masks total number of photon counts, and this number is re- 
could be applied manually or automatically. lated to the number of molecules per bead. Therefore, it 
Synthesis of a Dimer of an Aminopropyl Group and is possible to deduce the number of photons per fluores- 
a Fluorescent Group cein molecule that can be detected. For the curves 
In synthesizing the dimer of an aminopropyl group illustrated in FIG. 11A and 11B, this calculation indi- 
and a fluorescent group, a functionalized durapore cates the radiation ofabout 40 to 50 photons per fluores- 
membrane was used as a substrate. The durapore mem- cein molecule are detected, 
brane was a polyvinylidine difluoride with aminopropyl g # Determination of the Number of 
groups. The aminopropyl groups were protected with Molecules Per Unit Area 

the DDZ group by reaction of the carbonyl chloride Aminopropylated glass microscope slides prepared 

with the amino groups, a reaction readily known to according to the methods discussed above were utilized 

those of skill in the art The surface bearing these m order to establish the derisity of labeling of the shdes. 

groups was placed in a solution of THF and contacted ^ free flTn j nn termini of the slides were reacted with 

with a mask bearing a checkerboard pattern of I mm FITC (fluorescein isothiocyanate) which forms a cova- 

opaque and transparent regions. The mask was exposed lent 1inVa ^ ^ ^ am inr> group. The slide is then 

to ultraviolet light having a wavelength down to at least scaimed t0 count ^ number of fluorescent photons 

about 280 nm for about 5 minutes at ambient tempera- «> generated m a region which, using the estimated 40-50 
ture, although a wide range of exposure times and tern- hotons er fl UO rescent molecule, enables the calcula- 

peratures may be appropriate in various embodiments tion Qf ^ number of molecu ies which are on the sur- 
of the invention. For example, m one embodiment, an UQ j t 

exposure time of between about 1 and! 5000 seconds may A ^ ^ ^ } silane on its surface was 

be used at process temperatures of between -70 and iaanmd m a j ^ of jrjjc in DMF for 1 

x e- j i_ j * ^ f . hour at about ambient temperature. After reaction, the 

In one preferred embodiment, exposure tunes of be- ... . . . K, t , An „,oct,*^ 

i. * 1 j <nrk , - slide was washed twice with DMF and then washed 

tween about I and 500 seconds at about ambient pres- . , , . . 

j » ^^j^^v^j:™^*, with ethanol, water, and then ethanol agam. It was then 

sure are used. In some preferred embodiments, pressure ' . ' ° ■ , , 

, . ^ . « A _ sn dried and stored in the dark until it was ready to be 

above ambient is used to prevent evaporation. DU . ~~ J 

The surface of the membrane was then washed for examined - . 
about 1 hour with a fluorescent label which included an J^^f ^ t « fl 
active ester bound to a chelate of a lanthanide. Wash **G. 11A and 11B, and by integratmg the fluorescent 
times will vary over a wide range of values from about ™Ier * e exponentially decaying signal the 
a few minutes to a few hours. These materials fluoresce 55 number of free ammo groups on the surface after den- 
in the red and the green visible region. After the reac- vatization was determined. It was detennined that slides 
tion with the active ester in the fluorophore was com- with labeling densities of 1 fluorescem per KPXlV to 
plete, the locations in which the fluorophore was bound -2X2 nm could be reproduribly made as the concen- 
could be visualized by exposing them to ultraviolet light Oration of aminopropyltriethoxysilane varied from 
and observing the red and the green fluorescence. It 60 10~ 5 % to 10 l %. 

was observed that the derivatized regions of the sub- F. Removal of NVOC and Attachment of A Fiuores- 

strate closely corresponded to the original pattern of cent Marker 

the mask. NVOC-GABA groups were attached as described 

D. Demonstration of Signal Capability above. The entire surface of one slide was exposed to 

Signal detection capability was demonstrated using a 65 light so as to expose a free amino group at the end of the 

low-level standard fluorescent bead kit manufactured gamma amino butyric acid. This slide, and a duplicate 

by Flow Cytometry Standards and having model no. which was not exposed, were then exposed to fluores- 

824. This kit includes 5.8 jim diameter beads, each im- cein isothiocyanate (FITC). 
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FIG. 12A illustrates the slide which was not exposed Monomer-by-monomer synthesis of YGGFL and 

to light, but which was exposed to FITC The units of GGFL in alternate squares was performed on a slide in 

the x axis are time and the units of the y axis are counts. a checkerboard pattern and the resulting slide was ex- 

The trace contains a certain amount of background posed to the Herz antibody. This experiment and the 

fluorescence. The duplicate slide was exposed to 350 5 results thereof are illustrated in FIGS. 14A, 14B, ISA, 

nm broadband Illumination for about 1 minute (12 . and 15B. 

mW/cm 2 , —350 nm illumination), washed and reacted In FIG. 14A, a slide is shown which is derivatized 

with FITC The fluorescence curves for this slide are with the aminopropyl group, protected in this case with 

shown in FIG. 12B. A large increase in the level of t-BOC (t-butoxycarbonyl). The slide was treated with 

fluorescence is observed, which indicates photolysis has 10 TFA to remove the t-BOC protecting group. E- 

exposed a number of amino groups on the surface of the aminocaproic acid, which was t-BOC protected at its 

slides for attachment of a fluorescent marker. amino group, was then coupled onto the aminopropyl 

G. Use of a Mask in Removal of NVOC groups. The aminocaproic acid serves as a spacer be- 
The next experiment was performed with a 0.1% ^ the aminopropyl group and the peptide to be 

aminopropylated slide. Light from a Hg— Xe arc lamp 15 synthesized. The amino end of the spacer was de- 
was imaged onto the substrate through a laser-ablated protected and coupled to NVOC-leucme. The entire 
chromenon-glass mask in direct contact with the sub- slide was then iUunimated with 12 mW of 325 nm broad- 
strate. band illumination. The slide was then coupled with 

This slide was illuminated for approximately 5 min- NVOC-phenytoke and washed. The entire slide was 

utes, with 12 mW of 350 nm broadband light and then 20 *** » NVOC-glycme and 

reacted with the 1 mM FITC solution. It was put on the The shde was lUummated and coupled to 

laser detection scanning stage and a graph was plotted N Y? C " 8 te l ° $ m * e sequence shown m ^ 

as a two-dimensional representation of position color- po . on ,° . ' _ ' ^ . , 

coded for fluorescence intensity. The fluorescence in- „ shown ™ ™- alternating regions of the 

. r 4X r*: n 25 slide were then illuminated using a projection print 

tensity Cm counts) as a function of location is given on . 50Qx <qo am checkerboard mask: thus, the 

the color scale to the right of FIG. 13A for a mask usmg a DUUX f^ ^ m cneciecrooara masK, tnus, tne 

i. * i#vxwiaa ammo group of glycine was exposed only m the lighted 

havmg 100X100 ,un squares. areas. WheA the next coupling chemistry step walcar- 

The experiment was repeated a number of times rfed NVOC-tyrosine was added, and it coupled 

fcrough various masks. ThefluorKcence pattern for a 3Q om at ±QSs which had received mumma g on . 

lU " s i rated , m mG - " B : fo " 20 The entire slide was then fluminated to remove all the 

in FIG. 13C, and for a 10 pn mask in FIG. 13D The nvqc leaving a checkerboard of YGGFL in 

mask pattern is distinct down to at least about 10 um the hghted ^ ^ m ^ other ^eas, GGFL. The 

squares usmg this lithographic technique. Herz antibody (which recognizes the YGGFL, but not 

H. Attachment of YGGFL and Subsequent Exposure 35 GGFL) was then added, followed by goat anti-mouse 
10 fluorescein conjugate. 

Herz Antibody and Goat Antimouse The resulting fluorescence scan is shown in FIG. 

In order to establish that receptors to a particular 15A> ^ ^ codmg for ^ fluorescence intensity 
polypeptide sequence would bind to a surface-bound ^ again gj ven on ^ rigJlt Dar k areas contain the tetra- 
peptide and be detected, Leu enkephalin was coupled to 40 peptide GGFL, which is not recognized by the Herz 
the surface and recognized by an antibody. A slide was antibody (and thus there is no binding of the goat anti- 
derivatized with 0. 1 % amino propyl-triethoxysilane and mouse antibody with fluorescein conjugate), and in the 
protected with NVOC A 500 u-m checkerboard mask red YGGFL is present The YGGFL pentapep- 
was used to expose the slide in a flow cell using backside is recognized by the Herz antibody and, therefore, 

contact printing. The Leu enkephalin sequence (H2N- 45 there is antibody in the lighted regions for the fluore- 
tyrosme,glycme,glycme ) phenylalanme4eucine-C02H, scein-conjugated goat anti-mouse to recognize, 
otherwise referred to herein as YGGFL) was attached Similar patterns are shown for a 50 Jim mask used in 
via its carboxy end to the exposed amino groups on the direct contact ("proximity print**) with the substrate in 
surface of the slide. The peptide was added in DMF FIG. 15B. Note that the pattern is more distinct and the 
solution with the BOP/HOBT/DIEA coupling rea- 50 comers of the checkerboard pattern are touching when 
gents and recirculated through the flow cell for 2 hours the maclc is placed in direct contact with the substrate . 
at room temperature. (which reflects the increase in resolution using this 

A first antibody, known as the Herz antibody, was technique), 
applied to the surface of the slide for 45 minutes at 2 j. Monomer-by-Monomer Synthesis of YGGFL and 
fig/ml in a supercocktail (containing \% BSA and \% 55 FGGFL 

ovalbumin also in this case). A second antibody, goat A synthesis using a 50 p,m checkerboard mask similar 
anti-mouse fluorescein conjugate, was then added at 2 to that shown in FIG. 15B was conducted. However, P 
fig/ml in the supercocktail buffer, and allowed to incu- was added to the GGFL sites on the substrate through 
bate for 2 hours. An image taken at 10 fim steps indi- an additional coupling step. P was added by exposing . 
cated that not only can deprotection be carried out in a 60 protected GGFL to light and subsequent exposure to P 
well denned pattern, but also that (1) the method pro- in the manner set forth above. Therefore, half of the 
vides for successful coupling of peptides to the surface regions on the substrate contained YGGFL and the 
of the substrate, (2) the surface of a bound peptide is remaining half contained PGGFL. 
available for binding with an antibody, and (3) that the The fluorescence plot for this experiment is provided 
detection apparatus capabilities are sufficient to detect 65 in FIG. 16. As shown, the regions are again readily 
binding of a receptor. discernable. This experiment demonstrates that antibod- 

I. Monomer-by-Monomer Formation of YGGFL and ies are able to recognize a specific sequence and that the 
Subsequent Exposure to Labeled Antibody recognition is not length-dependent. 
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K. Monomer-by-Monomer Synthesis of YGGFL and TABLE 6-continued 
YPGGFL 



In order to further demonstrate the operability of the — pparen g — — 

, , , , „ f *• L-a^. Set D-a.a_ Set 

invention, a 50 u.m checkerboard pattern of alternating 



YGGFL and YPGGFL was synthesized on a substrate 5 waGFL 

using techniques like those set forth above. The result- 
ing fluorescence plot is provided in FIG. 17. Again, it is VIIL Illustrative Alternative Embodiment 
seen that the antibody is clearly able to recognize the According to an alternative embodiment of the in- 
YGGFL sequence and does not bind significantly at the vention, the methods provide for attaching to the sur- 
YPGGFL regions. 10 face a caged binding member which in its caged form 
L. Synthesis of an Array of Sixteen Different Amino has a relatively low affinity for other potentially bind- 
Acid Sequences and Estimation of Relative Binding ing species, such as receptors and specific binding sub- 
Affinity to Herz Antibody stances. 

Using techniques similar to those set forth above, an According to this alternative embodiment, the inven- 
array of 16 different amino acid sequences (replicated tion provides methods for forming predefined regions 
four times) was synthesized on each of two glass sub- on a surface of a solid support, wherein the predefined 
strates. The sequences were synthesized by attaching regions are capable of immobilizing receptors. The 
the sequence NVOC-GFL across the entire surface of methods make use of caged binding members attached 
the slides. Using a series of masks, two layers of amino 20 to the surface to enable selective activation of the pre- 
acids were then selectively applied to the substrate. defined regions. The caged binding members are liber- 
Each region had dimensions of 0.25 cm X 0.0625 cm. ated to act as binding members ultimately capable of 
The first slide contained amino acid sequences contain- binding receptors upon selective activation of the pre- 
ing only L amino acids while the second slide contained defined regions. The activated binding members are 
selected D amino acids. FIGS. 18A and 18B illustrate a 25 then used to immobilize specific molecules such as re- 
map of the various regions on the first and second slides, ceptors on the predefined region of the surface. The 
respectively. The patterns shown in FIGS. 18A and above procedure is repeated at the same or different 
18B were duplicated four times on each slide. The slides sites on the surface so as to provide a surface prepared 
were then exposed to the Herz antibody and fluore- with a plurality of regions on the surface contaming, for 
scein-labeled goat anti-mouse. 30 ^P 1 ^ ^ceptors. When recep- 
FIG. 19 is a fluorescence plot of the first slide, which tors immobilized in this way nave a dxfferential affinity 
contained only L amino acids. Red indicates strong far one or more ligands screenings and assays for the 
binding (149,000 counts or more) while black indicates <** conducted m the regions of the surface 
little or no binding of the Herz antibody (20,000 counts containing the preceptors 

, v — . • i, + Mr +;m rt f tu* ci;n^ « The alternative embodiment may make use of novel 

or less). The bottom ngnt-nana portion or tne suae 35 ' . 

' .. ^ — « , Al _ i i j • caged binding members attached to the substrate. 

appears "cut £T ^!^^ l ^ d ^ Caged (unactivated) members have a relatively low 

p^rocessmg. The sequence YGGFLis de£* most receptors ofsubstances that sr^cally bind 

singly recognized. The sequences ^^^^L^^ to uncaged binding members when compared with the 
YSGFL also exhibit strong recognition of the antibody. * ndin of activated binahlg memb ers. 

By contrast, most of the remaining sequences show litde 40 ^ ^ ^ fr ^ 

or no binding. The four duplicate portions of the slide ^ a of ^ Ued t0 ±e regions 

are extremely consistent m the amount of binding of ^ surface desircd to be activa ted. Upon application 
shown therein. . of a suitable energy source, the caging groups labilize, 

FIG. 20 is a fluorescence plot of the second shde. ntin ^ activated binding member. A 

Again, strongest bmdmg is exhibited by the YGGFL ^ ^ be UghL 

sequence. Significant bmdmg is also detected to Once the binding members on the surface are acti- 
YaGFL, YsGFL, and YpGFL (where L-amino acids vatfid ^ may ^ attached to a reC eptor. The receptor 
are identified by one upper case letter abbreviation, and chosen ^ ^ a monoc i ona i antibody, a nucleic acid 
D-amino acids are identified by one lower case letter 5Q 5^^^ a reC eptor, etc. The receptor will usu- 
abbreviation). The re m a inin g sequences show less bind- ^ ^ugh not always, be prepared so as to permit 
ing with the antibody. Note the low binding efficiency ar taching it, directly or indirectly, to a binding member, 
of the sequence yGGFL. For example, a specific binding substance having a 

Table 6 lists the various sequences tested in order of strong binding affinity for the binding member and a 
relative fluorescence, which provides information re- 55 strong affinity for the receptor or a conjugate of the 
garding relative binding affinity. receptor may be used to act as a bridge between binding 

TABLE 6 members and receptors if desired. The method uses a 

receptor prepared such that the receptor retains its 
activity toward a particular ligand. 
60 Preferably, the caged binding member attached to the 
solid substrate will be a photoactivatable biotin com- 
plex, i.c, a biotin molecule that has been chemically 
modified with photoactivatable protecting groups so 
that it has a significantly reduced binding affinity for 
65 avidin or avidin analogs than does natural biotin. In a 
preferred embodiment, the protecting groups localized 
in a predefined region of the surface will be removed 
upon application of a suitable source of radiation to give 
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binding members, that are biotin or a functionally analo- 
gous compound having substantially the same binding 
affinity for avidin or avidin analogs as does biotin. 

In another preferred embodiment, avidin or an avidin 
analog is incubated with activated binding members on 
the surface until the avidin binds strongly to the binding 
members. The avidin so immobilized on predefined 
regions of the surface can then be incubated with a 
desired receptor or conjugate of a desired receptor. The 
receptor wfll preferably be biotinylated, e.g., a bi- 
otinylated antibody, when avidin is immobilized on the 
predefined regions of the surface. Alternatively, a pre- 
ferred embodiment will present an avidin/biotinylated 
receptor complex, which has been previously prepared, 
to activated binding members on the surface. 
DC Conclusion 

The present inventions provide greatly improved 
methods and apparatus for synthesis of polymers on 
substrates. It is to be understood that the above descrip- 
tion is intended to be illustrative and not restrictive. 
Many embodiments will be apparent to those of skill in 
the art upon reviewing the above description. By way 
of example, the invention has been described primarily 
with reference to the use of photoremovable protective 
groups, but it will be readily recognized by those of skill 
in the art that sources of radiation other than light could 
also be used. For example, in some embodiments it may 
be desirable to use protective groups which are sensi- 
tive to electron beam irradiation, x-ray irradiation, in 
combination with electron beam lithograph, or x-ray 
lithography techniques. Alternatively, the group could 
be removed by exposure to an electric current. The 
scope of the invention should, therefore, be determined 
not with reference to the above description, but should 35 
instead be determined with reference to the appended 
claims, along with the full scope of equivalents to which 
such claims are entitled. 

What is claimed is: 

1. A substrate with a surface comprising 10 3 or more 40 
groups of oligonucleotides with different, known se- 
quences covalently attached to the surface in discrete 
known regions, said 10 3 or more groups of oligonucleo- 
tides occupying a total area of less than 1 cm 2 on said 
substrate, said groups of oligonucleotides having differ- 45 
ent nucleotide sequences. 

2. The substrate as recited in claim 1 wherein said 
substrate comprises 10 4 or more different groups of 
oligonucleotide with known sequences covalently cou- 
pled to discrete known regions of said substrate. 50 
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3. The substrate as recited in claim 1 wherein said 
substrate comprises 10 5 or more different groups of 
oligonucleotides with known sequences in discrete 
known regions. 

4. The substrate as recited in claim 1 wherein said 
substrate comprises 10 6 or more different groups of 
oligonucleotides with known sequences in discrete 
known regions. 

5. The substrate as recited in claim 1 wherein said 
groups of oligonucleotides are at least 50% pure within 
said discrete known regions. 

6. The substrate as recited in claim 1 wherein the 
groups of oligonucleotides are attached to the surface 
by a linker. 

7. An array of more than 1,000 different groups of 
oligonucleotide molecules with known sequences cova- 
lently coupled to a surface of a substrate, said groups of 
oligonucleotide molecules each in discrete known re- 
gions and differing from other groups of oligonucleo- 
tide molecules in monomer sequence, each of said dis- 
crete known regions being an area of less than about 
0.01 cm 2 and each discrete known region comprising 
oligonucleotides of known sequence, said different 
groups occupying a total area of less than 1 cm 2 . 

8. The array as recited in claim 7 wherein said area is 
less than 10,000 microns 2 . 

9. The array as recited in claim 7 made by the process 
of: 

exposing a first region of said substrate to light to 
remove photoremovable groups from nucleic acids 
in said first region, and not exposing a second re- 
gion of said surface to light; 

covalently coupling a first nucleotide to said nucleic 
acids on said part of said substrate exposed to light, 
said first nucleotide covalently coupled to said 
photoremovable group; 

exposing a part of said first region of said substrate to 
light, and not exposing another part of said first 
region of said substrate to light to remove said 
photoremovable groups; 

covalently coupling a second nucleotide to said part 
of said first region exposed to light; and 

repeating said steps of exposing said substrate to light 
and covalently coupling nucleotides until said 
more than 500 different groups of nucleotides are 
formed on said surface. 

10. The array as recited in claim 7 comprising more 
than 10,000 groups of oligonucleotides of known se- 
quences. 
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ABSTRACT 



Libraries of unimolecular, double-stranded oligonucleotides 
on a solid support. These libraries arc useful in pharmaceu- 
tical discovery for the screening of numerous biological 
samples fo: specific interactions between the double- 
stranded oligonucleotides, and peptides, proteins, drugs and 
RNA. In a related aspect, the present invention provides 
libraries of con formation ally restricted probes on a solid 
support. The probes arc restricted in their movement and 
flexibility using double- stranded oligonucleotides as scaf- 
folding. The probes arc also useful in various screening 
procedures associated with drug discovery and diagnosis. 
The present invention further provides methods for the 
preparation and screening of the above libraries. 

6 Claims, 1 Drawing Sheet 
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SURFACE-BOUND, UNIMOLECULAR, 
DOUBLE-STRANDED DNA 

GOVERNMENT RIGHTS 

Research leading to the invention was funded in part by 
NIH Gran; No. R01HG00813-03 and the government may 
have certain fights to the invention. 

BACKGROUND OF THE INVENTION 

The present invention relates to the field of polymer 
synthesis and the use of polymer libraries for biologies] 
screening. More specifically, in one embodiment the inven- 
tion provides arrays of diverse double-stranded oligonucle- 
otide sequences. In another embodiment, the invention pro- 
vides arrays of conformanonally restricted probes, wherein 
the probes are held in position using double-stranded DNA 
sequences as scaffolding. Libraries of diverse unimolecular 
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In the above-referenced Fodor ct aL. PCT application, an 
elegant method is described for using a computer-conirolled 
system to direct a VLSIPS™ . procedure. Using this 
approach, one heterogenous array of polymers is converted, 
through simultaneous coupling at a number of reaction sites, 
into a different heterogenous array. See, U.S. Pat. No. 
5.384.261 and U.S. application Ser. No. 07/980,523, the 
disclosures of which are incorporated herein for all pur- 
poses. 

The development of VLSIPS™ technology as described 
in the above-noted U.S. PaL No. 5.143.854 and PCT patent 
publication Nos. WO 90715070 and 92/10092. is considered 
picmeering technology in the fields of combinatorial synther 
sis and screening of combinaiorial libraries. More recently, 
patent application Set No. 08/082^37, filed Jun. 25. 1993 
now abandoned, describes methods for making arrays of 
oligonucleotide probes that can be used to check deter- 
mine a partial or complete sequence of a target nucleic acid 
and to detect the presence of a nucleic acid containing a 



used, for example, in screening studies for extermination of 
binding affinity exhibited by binding proteins, drugs, ot 
RNA. 

Methods of synthesizing desired single stranded DNA ^ 
sequences are weU known to those of skill in the art. In 
particular, methods or synthesizing oligonucleotides are 
found in, for example, Oligonucleotide Synthesis: A Prac- 
tical Approach, Gait. ed.. IRL Press, Oxford (1984). incor- 
porated herein by reference in its entirety for all purposes. ^ 
Synthesizing unimolecular doubie-stranded DNA in solution 
has also been described. Sec, Durand, et al. Nucleic Acids 
Res 18-6353-6359 (1990) and Thomson, et al. Nucleic 
Acids Res. 21:5600-5603 (1993). the disclosures of both 
being incorporated herein by reference. ^ 

Solid phase synthesis or biological polymers has been 
evolving since the early "Merrificld" solid phase pepnde 
synthesis, described in Merrifield, J. Am, Chem. Soc. 
85.2149-2154 (1963). incorporated herein by reference for 



A number of biochemical processes of pharmaceutical 
interest involve the interaction of some species, e.g., a drug, 
a peptide or protein, or RNA. with double-stranded DNA. 
For example. protcin/DNA binding interactions are involved 
with a number of transcription factors as well as tumor 
suppression associated with the p53 protein and the genes 
contributing to a number of cancer conditions. 

SUMMARY OF THE INVENTION 

High-density arrays of diverse unimolecular. double- 
stranded oligonucleotides, as well as arrays of conforma- 
tionally restricted probes and methods for their use are 
provided by virtue of the present invention. In addition 
methods and devices for detecting duplex formation of 
oligonucleotides on an array of diverse single-stranded 
oligonucleotides arc also provided by this invention. Fur- 
ther, an adhesive based on the specific binding characicns- 
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ajiiiiiw^tJ, — - t r - Mill, oai oujiwi'v r - 

85.2149-2154 (1963). incorporated herein by reference tor ^ Qf {wQ m ^ of complementary oligonucleotides is 
all purposes. Solid-phase synthesis techniques have been ^ idcd ^ mc prcscn i invention, 
provided for the synthesis of several peptide sequences on Acc0fding loonc upccl 0 f the p ' - - — — 

for example, a number or "pins." Sec e.g., Geysen el al.. I ACCOrm * 
Invnun. Mcth. 102:259-274 (1987). incorporated herein by 
reference for all purposes. Other solid-phase techniques 
involve, for example, synthesis or various peptide sequences 
on different cellulose disks supported in a column. See Frank 
and Doring. Tetrahedron 44:6031-6040 (1988). incorpo- 
rated herein by reference for all purposes. Still other solid- 
phase techniques arc described in U.S. PaL No. 4.728.502 
issued to Hamill and WO 90W626 (Bcattic. inventor). 

Each of the above techniques produces only a relatively 
low density array of polymers. For example, the technique 
described in Geysen et al. is limited to producing 96 
different polymers on pins spaced in the dimensions of a 
standard microliter plate. M 

Improved methods of forming large arrays of oligonucle- 
otides, peptides and other polymer sequences in a short 
period of time have been devised. Of particular note. Firrung 
et al.. U.S. PaL No. 5.143.854 (see also PCT Application No. 
WO 90/15070) and Fodor et al.. PCT Publication No. wO 60 
92/10092. all incorporated herein by reference, disclose 
methods of forming vast arrays of peptides, oligonucleotides 

. and other polymer sequences using, for example, light- 

directed synthesis techniques. Sec also, Fodor ct al.. Scvnce. * f n 1^^7 v '^ c ^ prob es attached to" a solid sup- 
251:767-777 (1991). also ir^rrxrated h^ « ^23C roerabOT each have the 

for all purposes. These procedures are now referred to as portis prowoea. sue ma 
VLSIPS™ procedures, formula: 



According to one aspect of the present invention, libraries 
of unimolecular. double-stranded oligonucleotides arc pro- 
vided. Each member of the library is comprised of a solid 
support, an optional spacer for attaching the doublc-sirandcd 
oligonucleotide to the support and for providing sufficient 
space between the double-stranded oligonucleotide and the 
solid support for subsequent binding studies and assays, an 
oligonucleotide attached to the spacer and further attached to 
a second complementary oligonucleotide by means of a 
flexible linker, such thai the two oligonucleotide portions 
exist in a doublc-sirandcd configuration, More particularly, 
the members of the libraries of the present invention can be 
represented by the formula: 

Y-l^-X'— L'-X 1 

in which Y is a solid support. L* is a bond or a spacer, L 2 is 
a flexible linking group, and X 1 and X 2 are a pair of 
complementary oligonucleotides. 

In a specific aspect of the invention, the library of 
different unimolecular. double-stranded oligonucleotides 
can be used for screening a sample for a species which binds 
to one or more members of the library. 
In a related aspect of the invention, a library of different 
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-X ,l -Z-X ,J 

in which X". and X ia arc complementary oligonucleotides 
end 2 is a probe having sufficient length such thai X 11 and 
X 13 form a double-stranded oligonucleotide portion of the 
member and thereby restrict the conformations available to 
the probe. In a specific aspect of the invention, the library of 
different conforrnauorially-resiricted probes can be used for 
screening a sample for a species which binds to one or more 
probes in the library. 

According to yet another aspect of the present invention, 
methods and devices for the btodecuonic detection of 
duplex formation arc provided 

According to still another aspect of the invention, an 
adhesive is provided which comprises two surfaces of 
complementary oligonucleotides. 

BRIEF DESCRIPTION OF THE DRAWINGS 
FIGS. 1 A to IF illustrate the preparation of a member of 



M lighi -directed" synthesis, discussed below, the protecting 
groups will be phtxolabflc protecting groups such as NVOC, 
MeNPOC, and those disclosed in co-pending Application 
PCT/US93/10162 (filed Oct. 22, 1993), incorporated herein 
5 by reference. In other metbodi, protecting groups may be 
removed by chemical methods and include groups such as 
FMOC, DMT and others known to those of skill in the an_ 
Complementary or substantially complementary: Refers 
to the hybridization or base pairing between nucleotides or 
ID nucleic acids, such as. for instance, between the two strands 
of a double stranded DNA molecule or between an oligo- 
nucleotide primer and t primer binding site on a single 
stranded nucleic acid to be sequenced or amplified Comple- 
mentary nucleotides arc, generally, A and T (or A and U). or 
13 C and G. Two tingle stranded RNA or DNA molecules are 
said to be substantially complementary when the nucleotides 
or one strand, optimally aligned and compared and with 
appropriate nucleotide insertions or deletions, pair with at 
least about 80% of the nucleotides of the other strand. 



» w - 1 V °S r U T™5 ™n?i7XL;i,^SS » usuallyatleastaboul9<Wbto95%.andmon: P rcrcnbly from 
a library of surface-bound, ummolccular double-stranded ' infV3 , ■ 



DNA as well as binding studies with receptors having 
specificity for cither the double stranded DNA portion, a 
probe which is held in a confonnationally restricted form by 
DNA scaffolding, or a bulge or loop region of RNA. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

Abbreviations 

The following abbreviations arc used herein: phi. phenan- 
threncquinone diiminc; phen', 5-amido-glutaric acid-1,10- 
phenamhrolinc; dppz, dipyridophenazinc. 
Glossary 

The following terms arc intended to have the following 
general meanings as they arc used herein: 

Chemical terms: As used herein, the term "alky!" refers to 
a saturated hydrocarbon radical which may be straight-chain 
or branched-chain (for example, ethyl, isopropyl, l-amyl. or 
2.5-dimcihylhcxyl). When "alky!" or "alkylcnc" is used to 



about 98 to 100%. 

Alternatively, substantial complementary exists when an 
RNA or DNA strand will hybridize under selective hybrid- 
ization conditions to its complement. Typically, selective 
25 hybridization will occur when there is at least about 65% 
complementary over a stretch of at least 14 to 25 nucle- 
otides, preferably at least about 75%, more preferably at 
least about 90% complementary. S. ce, M. Kanchisa Nucleic 
AeUs Res. 12:203 (1984], incorporated herein by reference. 
30 Stringent hybridization conditions will typically include 
salt concentrations of less than about 1M, more usually less 
than about 500 mM and preferably less than about 200 mM. 
Hybridization temperatures can be as low as 5° C, but arc 
typically greater than 22° C. more typically greater than 
35 about 30° C. and preferably in excess of about 37° C. 
Longer fragments may require higher hybridization tem- 
peratures for specific hybridization. As other factors may 
aficc; the stringency, of hybridization, including base com- 
position and length of the complementary strands, presence 



refer to a linking group or a spacer, it is taken to be a group 40 of organic solvents and cxtcn; of base mismatching, the 



having two available valences for covalcnt attachment, for 
cxampl:. — CH 2 CH 2 — , -CH 2 CH 2 CH 2 — . 

— CHjCHXHtCHjX^Hj— and — CHjCCHjCHj^CHj— . 
Preferred alky) groups as substiiucnis arc those containing 1 
to 10 carbon atoms; with those containing I to 6 carbon 
atoms being particularly preferred. Preferred alky! or alky- 
lcnc groups as linking groups arc those containing I to 20 
carbon atoms, with those containing 3 to 6 carbon atoms 
being particularly preferred. The term "polyethylene glycol" 
is used to refer to those molecules which have repeating 
units of ethylene glycol, for example, hcxaclhylcnc glycol 
(HCMCH 2 CH,0) 5 — CHjCHjOH). When the term "poly- 
ethylene glycol" is used to rc r cr to linking groups and spacer 
groups, it would be understood by one of skill In the an that 



combination of parameters is more important than the abso- 
lute measure of any one alone. 

Epitope: The portion or an antigen molecule which is 
delineated by the area of interaction with the subclass of 
45 receptors known as antibodies. 

Identifier tag: A means whereby one can identify which 
molecules have experienced a particular reaction in the 
synthesis of an oligomer. The identifier tag also records the 
step in the synthesis scries in which the molecules experi- 
» enced ihat particular monomer reaction. The identifier tag 
may be any recognizable feature which is, for example: 
microscopically distinguishable in shape, size, color, optical 
density, etc.; differently absorbing or emitting of light; 
chemically reactive; magnetically or electronically encoded; 



other polycthcrs or polyols could be used as well (i. c, 55 or in some other way distinctively marked with the required 



polypropylene glycol or mixtures of ethylene and propylene 
glycols). 

The term "prelecting group*' as used herein, refers to any 
of the groups which arc designed to block one reactive site 
in a molecule while a chemical reaction is carried out at 
another reactive site. More particularly, the protecting 
groups used herein can be any of those groups described in 
Greene, a a!.. Protective Croups In Organic Chemistry, 2nd 
Ed, John Wiley & Sons. New York, N.Y. 1 99 1 , incorporated 



information. A preferred example of such an identifier tag is 
an oligonucleotide sequence. 

Ugand/Probc: A ligand is a molecule (hat is recognized by 
a particular receptor. The agent bound by or reacting with a 
60 receptor is called a "Ugand" a term which is definitionally 
meaningful only in terms of its counterpart receptor. The 
term 'liganoT* does not imply any particular molecular size 
or other structural or compositional feature other than that 
the substance in question is capable of binding or otherwise 



»— I » ■ - m *~J ww.™, - w — T ~ ~ - » - . I - ---- • - — < 

herein by reference. The proper selection of protecting 65 interacting with the receptor. Also, a ligand may serve either 

groups for a particular synthesis will be governed by the as the natural ligand to which the receptor binds, or as a 

overall methods employed in the synthesis. For example, in functional analogue that may act as an agonist or antagonist. 



* • 



5,556,752 



Example* of ligands thai can be investigated by this inven- 
tion include, but arc not restricted to, agonists and antago- 
nists for cell membrane receptors, toxins and venoms, viral 
epitopes, hormones (e.g.. opiates, steroids, etc.), hormone 
receptors, peptides, enzymes, enzyme substrates, substrate 5 
analogs, transition state analogs, cofaciors, drugs, proteins, 
and antibodies. The term *pro bc *' rcfcn 10 ^° st molecules 
which are expected to act like ligands but for which binding 
information is typically unknown. For example, if a receptor 
is known to bind a Ugand which is a peptide p-tura, a 10 
"probe" or library of probes will be those molecules 
designed to mimic the peptide P-turn. In instances where the 
particular ligand associated with a given receptor is 
unknown, the term probe refers to those molecules designed 
as potential Uganda for the receptor. . L5 

Monomer Any member of the set of molecules which can 
be joined together to form an oligomer or polymer. The set 
of monomers useful in the present invention includes, but is 
not restricted to, for the example of oligonucleotide synthe- 
sis, the set of nucleotides consisting of adenine, thymine, 20 
cytosine, guanine, and uridine (A, T, C G, and U, respec- 
tively) and synthetic analogs thereof. As used herein, mono- 
mers refers to any member of a basis set for synthesis of an 
oligomer. Different basis sets of monomers may be used at 
successive steps in the synthesis of a polymer. 23 

Oligomer or Polymer. The oligomer or polymer 
sequences of the present invention are formed from the 
chemical or enzymatic addition of monomer subunits. Such 
oligomers include, for example, both linear, cyclic, and 
branched polymers of nucleic acids, pory saccharides, phos- 30 
pholipids. and peptides having cither a-",.p-, or c>amino 
acidj. hctcrapolymcrs in which a known drug is covalcnJy 
bound to any of the above, polyurethanes, polyesters, poly- 
carbonates, polyureas, polyamidcs. polyethylrneimincs. 
polyarylcne sulfides, polysiloxanes. polyimides. polyac- 35 
ctatcs, or other polymers which will be readily apparent to 
one skilled in the art upon review or this disclosure. As used 
herein, the term oligomer or polymer is meant to include 
such molecules as p-turo mimctics, prostaglandins and ben- 
zodiazepines which can also be synthesized in a stepwise 40 
fashion on a solid supper*. 

Peptide: A peptide is an oligomer in which the monomers 
arc amino acids and which arc joined together through 
amide bonds and alternatively referred to as a polypeptide. 
In the context of this specification it should be appreciated 45 
that when a-amtno acids are used, they may be the L-optical 
isomer or the D-optical isomer. Other amino acids which arc 
useful in the present invention include unnatural amino acids 
such a p-alaninc. phcnylglycinc homoarginine and the like 
Peptides arc more than two amino acid monomers long, and 50 
often more than 20 amino add monomers long. Standard 
abbreviations for amino acids arc used (e.g., P for proline). 
These abbreviations arc included in Stryer. Biochemistry, 
Third Ed.. (1988), which is incorporated herein by reference 
for all purposes. 33 

Oligonucleotides: An oligonucleotide is a single-stranded 
DNA or RNA molecule, typically prepared by synthetic 
means. Alternatively, naturally occurring oligonucleotides, 
or fragments thereof, may be isolated from their natural 
sources or purchased from commercial sources. Those oli- 60 
gonucleotides employed in the present invention will be 4 to 
100 nucleotides in length, preferably from 6 to 30 nucle- 
otides, although oligonucleotides of different length may be 
appropriate. Suitable oligonucleotides may be prepared by 
the phosphoramiditc method described by Bcaucage and 65 
Carruthcrs. Tetrahedron Lett.. 22:1859-1862 (1981), or by 
the triester method according to Matteucci, et al., / Am. 



Chan. Soc. t 103:3185 (1981). both incorporated herein by 
reference, or by other chemical methods using either a 
commercial' automated oligonucleotide synthesizer or 
VLSIPS™ technology (discussed in detail below). When 
oligonucleotides are referred to as **douUe-strandcd. N it is 
understood by those of skill in the an that a pair of 
oligonucleotides exist in a hydrogen-bonded, helical array 
typically associated with, for example, DNA. In addition 10 
the 100% complementary form of double-stranded oligo- 
nucleotides, the term "double-stranded" as used herein is 
also meant to refer to those forms which include such 
structural features as bulges and loops, described more fully 
in such biochemistry texts as Stryer, Biochemistry, Third 
Ed., (1 988), previously incorporated herein by reference for 
all purposes. 

Receptor A molecule that has an affinity for a given 
ligand or probe Receptors may be tiarurally-occurring or 
man made molecules. Also, they can be employed in their 
unaltered narural or isolated state or as aggregates with other 
species. Receptors may be attached, covalently or nonco- 
valently, to a binding member, either directly or via a 
specific binding substance. Examples of r ecep t o rs which can 
be employed by this invention include, bu: are not restricted 
to, antibodies, cell membrane receptors, monoclonal anti- 
bodies and antisera reactive with specific antigenic deter- 
minants (such as on viruses, cells or other materials), drugs, 
polynucleotides, nucleic acids, peptides, cefaclors, lectins, 
sugars, polysaccharides, cells, cellular membranes, and 
organelles. Receptors are sometimes referred 16 in the an as 
ami-ligands. As the term receptors is used herein, no differ- 
ence in meaning is intended. A "ligand-receptor pair" is 
formed when two molecules have combined through 
molecular recognition 10 form a complex. Other examples of 
receptors which can be investigated by this invention 
include but arc not restricted to: 

a) Microorganism receptors: Determination or ligands or. 
probes that bind to receptors, such as specific transport 
proteins or enzymes essential to survival of microor- 
ganisms, is useful in a new class of antibiotics. Of 
particular value would be antibiotics against opporm- 
nistic fungi , protozoa, and those bacteria resistant to the 
antibiotics in current use. 

b) Enzymes: For instance, the binding site of enzymes 
such as the enzymes responsible for cleaving neu- 
rotransmitters. Determination of ligands or probes that 
bind to certain receptors, and thus modulate the action 
of the enzymes that cleave the different neurotransmit- 
ters, is useful in the development of drugs that can be 
used in the treatment of disorders of neurotransmission. 

c) Antibodies: For instance, the invention may be uscM 
in investigating the ligand -binding site on the antibody 
molecule which combines with the epitope of an anti- 
gen of interest Determining a sequence that mimics an 
antigenic epitope may lead to the development of 
vaccines of which the immunogen is based on one or 
more of such sequences, or lead to the development of 
related diagnostic agents or compounds useful in thera- 
peutic treatments such as for autoimmune diseases 
(e.g., by blocking the binding of the "self* antibodies). 

d) Nucleic Acids: The Invention may be useful in inves- 
tigating sequences of nucleic acids acting as binding 
sites Tot cellular proteins ("uans-acting factors"). Such 
sequences may include, e.g. t transcription factors, sup- 
pressors, enhancers or promoter sequences. 

c) Catalytic Polypeptides: Polymers, preferably polypep- 
tides, which are capable of promoting a chemical 





5,556.752 



8 



reaction involving the conversion of one or more 
rcaciams 10 one or more products. Such polypeptides 
generally include a binding site sped fie for at least one 
rcactani or reaction intermediate and an active func- 
tionality proximate to the binding site, which function- 
ality is capable of chemically modifying the bound 
rcactani. Catalytic polypeptides are described in. 
Lcrocr, RA. ct al.. Science 252: 659 (1991). which is 
incorporated herein by reference. 
0 Hormone receptors: For instance, the receptors for 
insulin and growth hormone. Determination of the 
ligands which bind with high affinity to a receptor is 
useful in the development of. for example, an oral 
replacement of the daily injections which diabetics 
must take to relieve the symptom* of diabetes, and in 
the other case, a replacement for the scarce human 
growth hormone that can only be obtained from cadav- 
ers or by recombinant DNA technology. Other 
examples arc the vasoconstrictive hormone receptors; 



to 



IS 



In FIG ID, a receptor 6, which can be a protetD. RNA 
molecule or other molecule which is known to bind to DNA, 
is introduced to the library. Dcxermining which member of 
a library binds to the receptor provides information which is 
useful for diagnosing diseases, sequencing DNA or RNA, 
identifying genetic characteristics, or in drug discovery. 

In FIG. IE, the linker 4 is a probe for which binding 
information is sought The probe is held in a conformation- 
ally restricted manner by the flanking oligomers 3 and 5, 
which are present in a double-stranded conformation. As a 
result, a library of conformational ly restricted probes can be 
screened for binding activity with a receptor 7 which has 
specificity for the probe. 

The present invention also contemplates the preparation 
of libraries of uni molecular, double-stranded oligonucle- 
otides having bulges or loops in one of the strands as 
depicted in FIG. IF. In FIG. IF. one oligonucleotide 5 is 
shown as having a bulge 8. Specific RNA bulges arc often 
recognized by proteins (eg.. TAR RNA is recognized by the 



determination of those ligands that bind to a receptor 20 TAT prolc i n 0 f HIV). Accordingly, libraries of RNA bulges 

• . _r j . i J • r \ • * - - **r Jl\ •nflA»t!f> • nntii— att/Mic 



may lead to the development of drugs to control blood 
pressure. 

g) Opiate receptors: Determination of ligands that bind to 
the opiate receptors in the brain is useful in the devel- 
opment of less-addictive replacements for morphine 
and related drugs. 
Substrate or Solid Support: A material having a rigid or 
scmi-rigid surface. Such materials will preferably take the 
form of plates or slides, small beads, pellets, disks or other 
convenient forms, although other forms may be used. In 
some embodiments, at least one surface of the substrate will 
be substantially flat. In other embodiments, a roughly spheri- 
cal shape ts preferred. 

Synthetic: Produced by in vitro chemical or enzymatic 
synthesis. The synthetic libraries of the present invention 
may be contrasted with those in viral or plasmid vectors, for 
instance, which may be propagated in bacterial, yeast, or 
other living hosts. 

DESCRIPTION OF THE INVENTION 

The broad concept of the present invention is illustrated in 
FIGS. 1A lo IF. FIGS. 1A. IB and 1C illustrate the prepa- 
ration or surface-bound unimolccular double stranded DNA. 
while FIGS. ID. IE. and IF illustrate uses for the libraries 
of the present invention. 

FIG. 1 A shows a solid support 1 having on attached spacer 
2, which is optional. Attached to the distal end of the spacer 
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or loops are useful in a number of diagnostic applications. 
One of skill in the art will appreciate that the bulge or loop 
can be present in either oligonucleotide portion 3 or 5. 
Libraries of Unimolccular, Double-Stranded Oligonucle- 
otides 

In one aspect, the present invention provides libraries of 
unimolccular double-stranded oligonucleotides, each mem- 
ber of the library having the formula: 

Y-L'-X'^-X 1 

in which Y represents a solid support. X 1 and X 3 represent 
a pair of complementary oligonucleotides, L* represents a 
bond or a spacer, and V represents a linking group having 
sufficient length such that X 1 and X 2 form a doublcstranded 
oligonucleotide. 

The solid support may be biological, nonbiological, 
organic, inorganic, or a combination of any of these, existing 
as panicles, strands, precipitates, gels, sheets, tubing, 
spheres, containers, capillaries, pads, slices, films, plates, 
slides, etc. The solid support is preferably flat but may lake 
on alternative surface configurations. For example, the solid 
support may contain raised or depressed regions on which 
synthesis lakes place. In some embodiments, the solid 
support will be chosen lo provide appropriate light-absorb- 
ing characteristics. For example, the support may be a 
polymerized Langmuir Blodgat film, functionalized glass. 
Si. Gc. GaAs, GaP. SiO,. SiN,. modified silicon, or any one 
of a variety of gels or polymers such as (poly)icirafluoro- 



is a first oligomer 3. which can be attached as a single uni: 50 ethylene, (poly)vinylidcndifluoridc, polystyrene. ! polycar- 

or synthesized on the support or spacer in a monomer by bonatc. or combinations thereof. Other suitable so id suppon 

monomer approach. FIG. IB shows a subsequent stag: in materials will be readily apparent to those of skill I in the art 

the preparation of one member of a library according to the Preferably, the surface of the solid support will contain 

present invention. In this stage, a flexible linker 4 is attached reactive groups, which could be carboxyl. amino. hydroxyU. 

uaihcdistalcndofthcoligomcr3.lnoihcrcmbodimcnu.thc 35 thiol, or the like. More preferably, the surface m 11 be 

flexible linker will be a probe. HG.1C shows the completed optically transparent and will have surface Si-OH func- 

surface-bound unimolccular double stranded DNA which is tionalities. such as arc found on sthca surfaces. ^ 
oncmembaoralibraa/v^OTinascccndoligomcrSisnow' Attached w J* so!id ^™ mc P^ 

attached to the distal end of the flexible linker (or probe). As spacer molecules arc preferably of sufficient lengOi to permit 

shown in FIG 1C, the length of Ihc flexible linker (or probe) 60 ihc double-stranded oligonucleotides m the completed mcm- 

4 is sufficient such that ihc first and second oligomers (which bcr of the library to interact freely with molecules exposed 

arc complementary) exist in a double-stranded conforma- lo the library. The spacer molecules when present, arc 

lion. It will be appreciated by one of skill in the an. that ihc typically&-50 atoms loog to provide sufficient* exposure for 

libraries of the present invention will contain multiple, me attached double-stranded DNA molecule. Thcspaccr.L . 

individually synthesized memben which can be screened for 65 is comprised of a surface attaching portion and a longer 

various types of activity. Three such binding events are chain portioruThcstirfacemching portion is ^pajofL 

illuslratcd iTnOS. 1 D. IE and IF. which is directly attached to the solid support. This portion 
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can be attached to ihc solid support via carbon-carbon bonds 
using, for example, supports having (polyXrifluorochloro- 
ethyleoe surfaces, ox preferably, by siloxane bonds (using, 
for example, gins or silicon oxide as the solid support). 
Siloxane bonds wilh the surface of the support are formed in 
one embodiment via reactions of surface attaching portions 
bearing trichlorosilyl or trialkoxysQyl groups. The surface 
auaching groups will also have a site for attachment of the 
longer chain portion. For example, groups which are suitable 
for attachment to a longer chain portion would include 
amines, hydroxyl, thiol, and carboxyl. Preferred surface 
attaching portions include arninoalkylsilanes and bydroxy- 
alkylsilanes. In particularly preferred embodiments, the sur- 
face auaching portion of L 1 is either bistf-bydroxyethyl)- 
arninopropyltrieihoxysilane, 
2-hydroxyethytarwnopropyltriethoxysna^^ 
ethoxysilane or bydroxypropyltrietboxysilane. 

The longer chain ponioa can be any of a variety of 
molecules which arc inert to the subsequent conditions for 
polymer synthesis. These longer chain portions will typi- 
cally be ary! acetylene, ethylene glycol oligomers containing 
2-14 monomer units, diamines, diacids, amino acids, pep- 
tides, o; combinations thereof. In some embodiments, the 
longer chain portion is a polynucleotide. The longer chain 
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of the compounds of the invention, the linking group will be 
provided with functional groups which can be suitably 
protected or activated. The linking group will be covalcntlv 
attached to each of the complementary oligonucleotides, X 
and X 2 , by mean* of an ether, ester, carbamate, phosphate 
ester or amine linkage. The flexible linking group L 2 will be 
i frorV^ to the y-hydroxyl of the terminal monomer of X 
and to the 3'*hydroxyl of the initial monomer. of X 2 . Pre- 
ferred linkages are phosphate ester linkages which cart be 
formed in the same manner as the oligonucleotide linkages 
which are present in X 1 and X*. For example, hexaethyl- 
eneglycol can be protected on one terminus with a pboto- 
labile protecting group (Le., NVOC or MeNPOQ and 
activated on the other terminus with 2-cyanoethyl-N.N- 
disopropylarrriuo-chloro phosphite to form a phc*phcrarnid- 
ite. This linking group can then be used for construction of 
the libraries in the same manner as the photo labile-protected, 
phospboramidite-activated nucleotides. Alternatively, ester 
linkages to X 1 and X a can be formed when the L 2 has 
terminal carboxylic acid moieties (using the 5'-hydroxyl of 
X* and the 3'-hydroxyl of X 2 ). Other methods of forming 
ether, carbamate or amine linkages arc known to those of 
skill in the art and particular reagents and references can be 
found in such texts as March, Advanced Organic Chemistry. 
4th Ed. Wiley-lntcrsdcnce, New York, N.Y. 1992, incor- 



portion which is to be used as part of L 1 can be selected 25 ponLt ^ herein by reference 

based upon its hydrophilic/hydrophobic properties to The oligonucleotide. X 7 . which is covalentiy attached to 

improve presentation of the double-stranded oligonudc- ^ ^ 0 f fa c linking group is, like X\ a single- 

oiidcs to certain receptors, proteins or drugs. The longer 5U anded DNA or RNA molecule. The oligonucleotides 

chain portion of L 1 can be constructed of polyethylenegly- which ^ pa n of the present invention are typically of from 

cols, polynucleotides, alkylenc, polyalcobol. polyester, » 4 to about 100 nucleotides in length. Preferably, X 2 is 

polyaminc, polyphosphodiester and combinations thereof. m oligonucleotide which is about 6 to about 30 nucleotides 

Additionally, for use in synthesis of the libraries of the jn ^6 exhibits complementary to X 1 of from 90 to 

invention. L 1 will typically have a protecting group, attached jqq^ More preferably, X 1 and X 1 arc 100% complemen- 

10 a functional group (i.e., hydroxyl, amino or carboxylic ^ ln onc g roup 0 f embodiments, either X' or X will 

acid) on the distal or terminal end of the chain portion 35 further comprise a bulge or loop portion and exhibit cornple- 

(oppositc the solid support). After deprotection and cou- men tary of from 90 to 100% over the. remainder of the 

phng. the distal end is covalentiy bound to an oligomer. ^ oligonucleotide. 

Attached to the distal end of L 1 is an oligonucleotide. X , In a particularly preferred embodiment, the solid support 

which is a single -stranded DNA or RNA molecule. The i$ a si | ica support, the spacer is a polyethylenegly col con- 

oligonudcotides which are part of the present invention are 40 j U g aUa j uj ^ aminoalkylsilanc. the Unking group is a 

typically of from about 4 to about 100 nucleotides in length. polyethylencglycol group, and X 1 and X 2 are complemen- 

Prcfcrably. X 1 is an oligonucleotide which is about 6 to ^ oligonucleotides each comprising of from 6 to 30 

about 30 nucleotides in length. The oligonucleotide is typi- nucleic acid monomers. 

cally linked to L 1 via the 3'-hydroxyl group of the oligo- Wbmy can have virtually any number of different 

nucleotide and a functional group on L l which results in the 45 ^mbm, and will be limited only by the number or variety 

1 \- - ~ . ..... t • ■ -__T linn 



formation of an ether, ester, carbamate or phosphate ester 

linkage. . 

Attached to the distal end of X is a linking group, L , 
which is flexible and of sufficient length that X 1 can effec- 
tively hybridize with X 3 . The length of the linker will 50 
typically be a length which is at least the length spanned by 
two nucleotide monomers, and preferably at least four 
nucleotide monomers, while not be so long as to interfere 
with cither the pairing of X 1 and X 2 or any subsequent 



assays. Tne linking group itself will typically be an alkylene 55 jo.OOO per cm 7 . 



of compounds desired to be screened in a given application 
and by the synthetic capabilities of the practitioner In onc 
group of embodiments, the library will have from 2 up to 
100 members. In other groups of embodiments, the library 
will have between 100 and 10000 members, and between 
10000 and 1 000000 members, preferably on a solid support 
In preferred embodiments, the library will have a density of 
more than 100 members at known locations per cm 2 , prcf-. 
eraWy more than 1000 per cm 2 , more preferably more than 



group (of from about 6 to about 24 carbons in length), a 
polyethylencglycol group (of from about 2 to about 24 
cthyleneglycol monomers in a linear configuration), a poly- 
alcohol group, a polyamine group (e.g.. spermine, sperrni- 
dine and polymeric derivatives thereof), a polyester group 60 
(e.g., polyvinyl acrylate) having of from 3 to 15 ethyl 
acrylate monomers in a linear configuration), a polyphos* 
phodiescer group, or a polynucleotide (having from about 2 
to about 12 nucleic acids). Preferably, the linking group will 

£at»l^ 65 in wmch X" and X' 2 are rom^ 

Wc^K from about 1 to 4 hcxa- and Z is a probe The probe willhavc sufficient Jcng* suc> 

e SS^lSSrt linear krray. For use in synthesis that X" and X' 2 form a double-st^nded DNA portion of 



Libraries of Conformational ly Restricted Probes 

In still another aspect, the present invention provides 
libraries of conformational ly -restricted probes. Each of the 
members of the library comprises a solid, support having an 
optional spacer which is attached to an oligomer of the 
formula: 



-X"-2-X 
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each member. X 11 and X 11 arc &s described above for X 1 and 
X 1 respectively, except that for the present aspect of the 
invention, each member of the probe library can have ihc 
same X" and the same X ia , and differ only in the probe 
portion. In one group or embodiments. X 11 and X 12 arc 
cither a poly-A oligonucleotide or a poly-T oligonucleotide. 

As noted above, each member of Ihc library will typically 
have a different probe portion. The probes, Z, can be any of 
a variety of structures for which receptor-probe binding 



libraries on a Single Substrate 
Light-Directed Methods 

For those embodiments using a single solid support, the 
oligonucleotides of the present invention can be formed 
using a variety of technique* known to those skilled in the 
an of polymer synthesis on solid supports. For example, 
"tight direaed M methods (which are one technique in a 
family of methods known as VLSIPS™ methods) arc 
described in U.S. PaL No. 5.143.854, previously incorpo- 



O V4TlCrV Ol SuUfclUJw* iux otuimi iwwvwiw www* — — — - j 4 

information is sought for corJormaiior^ly-restricted fonw. 10 rated by reference. The light greeted methods dunned » 

For example, the probe can be an agonist or antagonist for the '854 patent involve acuvaung predefined regions a 

a «11 nJmtiane receptor, a toxin, venom, vital epitope, substrate or solid support and then conucung th< : s£sua* 

hormone, peptide, enzyme, collector, drug, protein or ami- with a preselected mowmcr solution. The rxedeftned 

bT ta o£ groVp or embodiments, the probes are different regions can be activated with a light source ^^^own 

peptides, each having or from about 4 to about 12 amino is through a mask (much in the manner °/ ^ ha ^** 

adds Preferably the probes will be linked via polyphos- techniques used in integrated circuit fabrication). Other 

phate dicsters. although other linkages arc also suitable. For regions of the substrate remain wacuve because they arc 

example, the last monomer employed on the X* 1 chain can blocked by Ihe mask from illumination and rcr^dKm,- 

be al-aminopropyl-funoionaHzcd phosphoramidite nuclc- cally protected. Thus, a light pattern defines which regions 

oudc (available from Glen Research, Sterling, Va., USA or 20 of the substrate react with a given monomer. By ^"tedy 



Gcnosys Biotechnologies. The Woodlands. Tex.. USA) 
which will provide a synthesis initiation site for ihc carboxy 
to amino synthesis of the peptide probe. Once the peptide 
probe is formed, a 3'-succinylaied nucleoside (from Cru- 
achem. Sterling, Va, USA) will be added under peptide 
coupling conditions. In yet another group of crnbodimenis. 
the probes will be oligonucleotides of from 4 to about 30 
nucleic acid monomers which will form a DNA or RNA 
hairpin structure. For use in synthesis, the probes can also 



activating different sets of predefined regions and contacting 
different monomer solutions with the substrate, a diverse 
array of polymers is produced on the substrate. Of course, 
other steps such as wishing unrcactcd monomer solution 
from the substrate can be used as necessary. Other tech- 
niques include mechanical techniques such as those 
described in PCT No. 92/10183, U.S. PaL No. 5384,261 
also incorporated herein by reference for all purposes. Still 
further techniques include bead based techniques such as 



«4X acid, anhydride*** derivative, thereof) for herein by .reference and pm based methods toAati *o« 

SmjtwoposiUoriontheprobctoeachorthecomple- described in U.S. PaL No. 5.288.514. also .ncorporated 

mentirv olifionudcoudes. herein by reference. 

n,e surfL of the solid soppon is preferably provided The VLSIPS™ methods m picfciicd for mabng the 

with a spacer molecule, although it will be understood that 35 compounds and Ubrancs of the present invcnuon. The 

~ . . *■ - ...rf-r* «r * *ni;H tnnnnn nnlionaJlv modified Wllfl SOaCCrS 



surface of a solid support, optionally modified with spacers 
having photolabilc protecting groups such as NVOC and 
McNPOC, is illuminated through a photolithographic mask, 
yielding reactive groups (typically hydroxyl groups) in the 
illuminated regions. A 3'-0-phosphoramidi:c activated 
dcoxynuclcosidc (protected at the 5 f -hydroxyl with a pho- 
tolabilc protecting group) is then presented to the surface 
and chemical coupling occurs at sites that were exposed to 
light. Following capping, and oxidation, the substrate is 



the spacer molecules arc not elements of this aspect of the 
invention. Where present, the spacer molecules will be as 
described above for L 1 . 

The libraries of conformationally restricted probes can 
also have virtually any number of members. As above, the O 
number of members will be limited only by design of the 
panicular screening assay for which the library will be used, 
and by the synthetic capabilities of the practitioner. In one 

croup of crnbodimenis. :hc library will have from 2 to 100 .. e - - - «, , , - . 

member*. In other groups of embodiments, the library will 45 rinsed and the surface illuminated through a second to 

have between lOOand 10000 members, and between 10000 expose additional hydroxyl groups for coupling. A second 

and 1000000 members. Also as above, in preferred cmbodi- 5-protcctcd, 3*-0.phosphoramid»ic activated ^«ynudco. 

mcnis. the library will have a density or more than 100 side is presented to the surface. The selec live -P^P™; 

members at known locations per cm', preferably more than lection and coupling cycles arc repeated unli the desired set 

?So£ cm? more preferably more liun 10.000 per cm*. » of oligonucleotides is produced. A tcrnativcly, an oligomer 

Preparation of the Libraries ° r fro™. for example. 4 to 30 nucleotides can be added to 

The present invention further provides methods for Ihc each of the preselected regions rather than *ynUtcsizc each 

preparation of diverse unimolecular. doublcstrandcd oligo- member in a monomer by monomer approach. At *is pouu 

nucleotides on a solid support. In one group or cmbodi- in the synthesis, other a flexible linking group or a probe can 

mcnis. the surface of a solid support has a plurality of 33 be attached in a similar rnanncr For ^Jf* 1 

preselected regions. An ohgonuclcolide of from 6 to 30 linking group such as polyethylene glycol will typically 

monomers is formed on each of the preselected regions. A having an activating group (i c.. a P tas P ho ^^^° n nl °" 

Unking group is then attached to the distal end of each of the end and a photolabilc protecting group attached to the other 

oligonucleotides. Finally, a second oligonucleotide is end. Suitably dcrivatiicd polyethylene glycol linking groups 

formed on the distal end of each linking group such that the 60 can be prepared by the methods described in Durand, ct al. 

SSnudeot^ is complement the oligonuclc Nucleic Acids Res. 18:6353-6359 (199Q. Briefly poly- 

oudc already present in the same preselected region The ethylene glycol (i.e hcjcacthylenc grycol) can be mono- 

linking group used will have sufficient length such that the protected using MeNPOC^hlondc. Following P^nncation 

complementary oligonucleotides form a unimolecular. of the mono-protected glycol. te*™™*^**™*? 

doublc-strandcd oligonucleotide. In another group of 65 can be activated with 2-cyan<Kthy fN^"^ 1 ?^; 

embodiments, each chemically distinct member of the roehlorophosphitc. Once the flcxir^c mbng group lias been 

library will be synthesized on a separate solid support attached to the first oligonucleotide (X ). deprotccuon and 
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coupling cycles will proceed using S-protected, 3'-0-phos- 
phoramidiie activaied deoxynurieosides or intact oligomer*. 
Probes can be attached in a manner similar to that used for 
the flexible Unking group. When the desired probe is itself 
an oligomer, it can be formed either in stepwise fashion oc 5 
the immobilized oligonucleotide or it can be separately 
synthesized and coupled to the immobilized oligomer in a 
single step. For example, preparation of conformationally 
restricted j^-turn mimetics will typically involve synthesis of 
an oligonucleotide as described above, in which the last to 
nucleoside monomer will be deri vatized with an aminoalkyl- 
funcuonalized pbosphoramidite. See, U.S. Pat No. 5,288, 
514, previously incorporated by reference. The desired 
peptide probe is typically formed in the direction from 
carboxyl to amine terminus. Subsequent coupling of a 15 
3'-sucrinylated nucleoside, for example, provides the first 
monomer in the construction of the complementary oligo- 
nucleotide strand (which is carried out by the above meth- 
ods). Alternatively, a library of probes can be prepared by 
first derivaiizing a solid support with multiple poly(A) or 20 
polyfT) oligonucleotides which arc suitably protected with 
photolabile protecting groups, deprotecting a: known sites 
and constructing the probe at those sites, then coupling the 
complementary polyCO or poly(A) oligonucleotide. 

Flow Channel or Spotting Methods 25 

Additional methods applicable to library synthesis on a 
sin el e substrate are described in co -pen ding applications 
Ser No. 07/980,523, filed Nov. 20, 1992, and U.S. Pat. No. 
5.384.26 1 . incorporated herein by reference for all purposes. 
In the methods disclosed in these applications, reagents are 30 
delivered to the substrate by either (1) flowing within a 
channel defined on predefined regions or (2) "spotting" on 
predefined regions. However, other approaches, as well as 
combinations of spotting and flowing, may be employed. In 
each instance, certain activated regions of the substrate are 33 
mechanically scparaicd from other regions when the mono- 
mer solutions arc delivered to the various reaction sites. 

A typical "flow channel" method applied to the com- 
pounds and libraries of the present invention can generally 
be described as follows. Diverse polymer sequences are 40 
synthesized at selected regions of a substrate or solid support 
by forming flow channels on a surface of the substrate 
through which appropriate reagents flow or in which appro- 
priate reagents arc placed. For example, assume & monomer 
"A" is to be bound to the subsiraie in a first group of selected 45 
regions. If necessary, all or part of the surface of the 
substrate in all or a pan of the selected regions is activated 
for binding by, for example, flowing appropriate reagents 
through all or some of the channels, or by washing the entire 
substrate with appropriate reagents. After placement of a 50 
channel block on the surface or the substrate, a reagent 
having the monomer A flows through or is placed in all or 
some of the channel(s). The channels provide fluid contact 
to the first selected regions, thereby binding the monomer A 
on the substrate directly or indirectly (via a spacer) in the 55 
first selected regions. 

Thereafter, a monomer B is coupled to second selected 
regions, some of which may be included among the first 
selected legions. The second selected regions will be in fluid 
contact with a second flow charm el (s) through translation, 60 
rotation, or replacement of the channel block on the surface 
of the substrate; through opening or closing a selected valve; 
or through deposition of a layer of chemical or photoresist 
If necessary, a step is performed for activating at least the 
second regions. Thereafter, the monomer B is flowed 63 
through or placed in the second flow channel(s), binding 
monomer B at the second selected locations. In this particu- 



lar example, the resulting sequences bound to the substrate 
at this stage of processing will be. for example. A, B, and 
AB. The process ts repealed to form a vast array of 
sequences of desired length at known locations on the 
substrate. 

After the substrate is activated, monomer A can be flowed 
through some of the channels, monomer B can be flowed 
through other channels, a monomer C can be flowed through 
still other channels, etc. In this manner, many or all of the 
reaction regions arc reacted with a monomer before the 
channel block must be moved or the substrate must be 
washed andYor reactivated By making use of many or all of 
the available reaction regions simultaneously, the number of 
washing and activation steps can be mini rm" zed. 

One of skill in the an will recognize that there are 
alternative methods of forming channels or otherwise pro- 
tecting a portion of the surface of the substrate. For example, 
according to some embodiments, a protective coating such 
as & bydrophOic or hydrophobic coating (depending upon 
the nature of the solvent) is utilized over portions of the 
substrate to be protected, sometimes in combination with 
materials that facilitate wetting by the reactant solution in 
other regions. In thjs manner, the flowing solutions are 
further prevented from passing outside of their designated 
flow paths. 

The "sporting** methods of pre paring compounds and 
libraries of the present invention can be implemented in 
much the same manner as the flow channel methods. For 
example, a monomer A can be delivered to and coupled with 
a firs: group of reaction regions which have been appropri- 
ately activated. Thereafter, a monomer B can be delivered to 
and* reacted with a second group of activated reaction 
regions. Unlike the flow channel embodiments described 
above, reactants are delivered by directly depositing (rather 
than flowing) relatively small quantities of them in selected 
regions. In some steps, of course, the entire substrate surface 
can be sprayed or otherwise coated with a solution. In 
preferred embodiments, a dispenser moves from region to 
region, depositing only as much monomer as necessary at 
each stop. Typical dispensers include a micropipciic to 
deliver the monomer solution to the substrate and a robotic 
system to control the position of the miCTOpipeue with 
respect to the substrate, or an ink*jci printer. In other 
embodiments, the dispenser includes a scries of lubes, a 
manifold, an array of pipettes, or the like so that various 
reagents can be delivered to the reaction regions simulta- 
neously. 

Pin-Based Methods 

Another method which is useful for the preparation of 
compounds and libraries of the present invention involves 
"pin based synthesis." This method is described in detail in 
U.S. Pat. No. 5.288.514, previously incorporated herein by 
reference The method utilizes a substrate having a plurality 
of pins or other extensions. The pins are each inserted 
* simultaneously into individual reagent containers in a tray. 
In a common embodiment, an array of 96 pins/containers is 
utilized. 

Each tray is filled with a particular reagent for coupling in 
a particular chemical reaction on an individual pin. Accord- 
ingly, the trays will often contain different reagents. Since 
the chemisiry disclosed herein has been established such that 
a relatively similar set of reaction conditions may be utilized 
to perform each of the reactions, it becomes possible to 
conduct multiple chemical coupling steps simultaneously. In 
the first step of the process the invention provides for the use 
of substraie(s) on which the chemical coupling steps are 
conducted. The substrate is optionally provided with a 
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spacer having active sites. In Ihc particular case of oligo- 
nucleotides, for example, the spacer may be selected from a 
wide variety of molecules which can be used in organic 
environments associated with synthesis as well as aqueous 
environments associated with binding studies. Examples of 
suitable spacers arc polyethylcncglyools, dicarboxylic acids, 
polyamines and alkylenes, substituted with, for example, 
tnethoxy and cthoxy groups. Additionally, the spacers will 
have an active site on the distal end. The active sites are 
optionally protected initially by protecting groups. Among a 
wide variety of protecting groups which are useful are 
FMOC BOC t-butyl esters, i-butyl ethers, and the like. 
Various exemplary protecting groups are described in, for 
example, Atherton el al.. Solid Phase Ptptid* Sptihesis t 1RL 
Press (1989), incorporated herein by reference. In some 
embodiments, the spacer may provide for a cleavable func- 
tion by way of, for example, exposure to acid or base. 
Libraries on Multiple Substrates 
Bead Based Methods 



probe which is present on each bead. A complete description 
of identifier tags for use in synthetic libraries is provided in 
co-pending application Sex.. No. 08/146,886 (filed Nov. 2, 
1993) previously incorporated by reference for all purposes, 
s Methods of library Screening 

A library prepared according to any of the methods 
described above can be used to screen for receptors having 
■ high affinity for cither unirnolccular, double-stranded oligo- 
nucleotides or conformational I y restricted probes. In one 
10 group of embodiments, a solution containing a marked 
(labelled) receptor is introduced to the library and incubated 
for a suitable period of time. The library is then washed free 
of unbound receptor and the probes or doable-stranded 
oligonucleotides having high affinity for the receptor arc 
15 identified by identifying those regions on the surface of the 
library where markers arc located. Suitable markers include, 
but are not limited to, radiolabels. chromophores, fiuoro- 
pnores, chemiluminescent moieties, and transition metals. 
Alternatively, the presence of receptors may be detected 



Ycl another method which is useful for synthesis of 20 using a variety of other techniques, such as an assay with a 



compounds and libraries of the present invention involves 
••bead based synthesis." A general approach for bead based 
synthesis is described copending application Scr. Nos. 
07/762^22 (filed Sep. 18, 1991 now abandoned); 07/946, 
239 (filed Sep. 16. 1992); 08/146,886 (filed Nov. 2, 1993); 25 
07/876.792 (filed Apr. 29. 1992) and PCTAJS93/04145 
(filed Apr. 28. 1993). the disclosures of which arc incorpo- 
rated herein by reference. 

For the synthesis of molecule* such as oligonucleotides 
on beads, a large plurality of beads arc suspended in a 30 
suitable carrier (such as water) in a container. The beads arc 
provided with optional spacer molecules having an active 
site. The active site is protected by an optional protecting 
group. 

In a first step of the synthesis, the beads are divided for 35 
coupling into a plurality of containers. For the purposes of 
this brief description, the number of containers will be 
limited to three, and the monomers denoted as A, B, C. D, 
E. and F. The protecting groups arc then removed and a first 



labelled enzyme, antibody, and the like. Other techniques 
using various marker systems for detecting bound receptor 
will be readily apparent to those skilled in the art 

In a preferred embodiment, a library prepared on a single 
solid support (using, for example, the VLSIFS™ technique) 
can be exposed to a solution containing marked receptor 
such as a marked antibody. The receptor can be marked in 
any of a variety of ways, but in one embodiment marking is 
effected with a radioactive label.Thc marked anubody birds 
with high affinity to an immobilized antigen previously, 
localized on thesurfacc. After washing the surface free of 
unbound receptor, the surface is placed proximate to x-ray. 
film or phosphorimagcrs to identify the antigens that arc 
recognized by the antibody. Alternatively, a fluorescent 
marker may be provided and detection may be by way of a 
charge-coupled device (CCO). fluorescence microscopy or 
laser scanning. 

When autoradiography is the detection method used, the 
marker is a radioactive label, such as 3a R The marker on the 



portion of the molecule to be synthesized is added to each of 40 surface is exposed to X-ray film or a phosphorimagcr, which 

is developed and read out on a. scanner An exposure time of 
about 1 hour is typical in one embodiment. Fluorescence 
detection using a fiuorophorc label, such as fluorescein, 
attached to the receptor will usually require shoncr exposure 
45 times. 

Quantitative assays for receptor concentrations can also 
be performed according to the present invention. In a direct 
assay method, the surface containing localized probes pre- 
pared as described above, is incubated with a solution 
50 containing a marked receptor for a suitable period of time. 
The surface is then washed free of unbound receptor. The 
. amount of marker present at predefined regions of the 
surface is then measured and can be related to the amount of 
receptor in solution. Methods and conditions for performing 
55 such assays arc well-known and arc . presented in. for 
example. L. Hood ci al.. Immunology, Bcnjamin/Cummings 
(1978). and E. Harlow el al.. Antibodies. A laboratory 
Manual, Cold Spring Harbor Laboratory, (1988). Sec, also 
VJS. Pal. No. 4 J76,l 10 for methods of performing sandwich 
60 assays. The precise conditions for performing these steps 
will be apparent to one skilled in the an. 

A competitive assay method for two receptors can also be 
employed using the present invention. Methods of conduct- 
ing competitive assays arc known to those of skill in the art. 
65 One such method involves immobilizing eonformationally 
restricted probes on predefined regions of a surface as 
described above. An unmarked first receptor is then bound 



the three containers (i. c. A is added to container 1, B is 
added to container 2 and C is added to container 3). 

Thereafter, the various beads arc appropriately washed of 
excess reagents, and remixed in one container. Again, it will 
be recognized that by virtue of the targe number of beads 
utilized at the outset, there will similarly be a large number 
of beads randomly dispersed in the container, each having a 
particular first portion of the monomer to be synthesized on 
a surface thereof. 

Thereafter, the various beads arc again divided for cou- 
pling in another group of three containers. The beads in the 
first container arc dc pro tec ted and exposed to a second 
monomer (D), while the beads in the second and third 
containers are coupled to molecule portions E and F respec- 
tively. Accordingly, molecules AD, BD, and CD will be 
present in the first container, while AE. BE, and CE will be 
present in the second container, and molecules AF, BF, and 
CF will be present in the third container. Each bead, how- 
ever, will have only a single type or molecule on its surface. 
Thus, all of the possible molecules formed from the first 
portions A, B, C and the second portions D, E, and F have 
been formed. 

The beads arc then rccombined into one container and 
additional steps such as arc conducted to complete the 
synthesis of the polymer molecules. In a preferred embodi- 
ment, the beads arc tagged with an identifying tag which is 
unique to the particular double-stranded oligonucleotide or 
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to the probes on the surface having a known specific binding 
affinity for the receptors. A solution containing t marked 
second receptor is theo introduced to the surface and incu- 
bated for & suitable time. The surface is then washed free of 
unbound reagents and the amount of tnarkfr remaining on 5 
the surface is measured. In another form of competition 
assay, marked and unmarked recep tors can be exposed to the 
surface -simultaneously. The amount of marker remaining on 
predefined regions of the surface can be related to the 
amount of unknown receptor in solution. Yet another form of 10 
competition assay wilt utilize two receptors having different 
labels, for example, two different chromophores. 

In other embodiments, in order to detect receptor binding, 
the double-stranded oligonucleotides which are formed with 
attached probes or with a flexible linking group will be 15 
treated with an intercalating dye. preferably a fluorescent 
dye. The library can be scanned to establish a background 
fluorescence After exposure of the library to a receptor 
solution, the exposed library will be vanned or illuminated 
and examined for those areas in which fluorescence has 20 
changed. Alternatively, the receptor of interest can be 
labeled with a fluorescent dye by methods known to those of 
skill in the an and incubated with the library of probes. The 
library can then be scanned or illuminated, as above, and 
examined for areas of fluorescence. 25 

In instances where the libraries are synthesized on beads 
in a number of containers, the beads are exposed to a 
receptor of interest In a preferred embodiment the receptor 
is fluorescent] y or radioactively labelled. Thereafter, one or 
more beads are identified that exhibit significant levels of. 30 
for example, fluorescence using one of a* variety of tech- 
niques. For example, in one embodiment, mechanical sepa- 
ration under a microscope is utilized. The identity of the 
molecule on the surface of such separated beads is then 
identified using. Tor example, NMR, mass spectrometry, 35 
PCR amplification and sequencing of the associated DNA, 
or the like. In another embodiment, automated sorting (i.e.. 
fluorescence activated celt sorting) can be used to separate 
beads (bearing probes) which bind to receptors from those 
which do not bind Typically the beads wilt be labeled and 40 
identified by methods disclosed in Nccdcls, et al., Proc. 
Natl Acad. ScL USA 90:10700-10704 (1993X incorporated 
herein by reference. 

The assay methods described above for the libraries of the 
present invention will have tremendous application in such 45 
endeavors as DNA "footpriming" of proteins which bind 
DNA. Currently, DNA footprinting is conducted using 
DNasc 1 digestion of double-stranded DNA in the presence 
of a putative DNA binding protein. Gel analysis of cut and 
protected DNA fragments then provides a "footprint" or 50 
where the protein contacts the DNA. This method is both 
labor and lime intensive, Sec Galas ct al. Nucleic Acid Res. 
5:3157 (1978). Using the above methods, a "footprint" could 
be produced using a single array of unimolccula.% double- 
stranded oligonucleotides in a fraction of the time of con- 55 
ventional methods. Typically, the protein will be labeled 
with a radioactive or fluorescent species and incubated with 
a library of uni molecular, double-stranded DNA Phospho- 
rimaging or fluorescence detection will provide a footprint 
of those regions on the library where the protein has bound. 60 
Alternatively, unlabeled protein can be used. When unla- 
beled protein is used, the double- stranded oligonucleotides 
in the library will all be labeled with a marker, typically a 
' fluorescent marker. Incorporation of a marker into each 
member of the fibrary can be carried out by terminating the 65 
oligonucleotide synthesis with a commercially available 
fluorescing pbosphoramidiie nucleotide derivative. Follow- 
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ing incubation with the unlabeled protein, the library will be 
treated with DNase 1 and examined for areas which arc 
protected from cleavage. 

The assay methods described above for the libraries of the 
present invention can also be used in reverse drug discovery. 
In such an application, a compound having known pharma- 
cological safety or other desired properties (eg., aspirin) 
could be screened against a variety of double-stranded 
oligonucleotides for potential binding. If the compound is 
shown to bind to a sequence auociaTfd with, for example* 
tumor suppression, the compound can be further examined 
for efficacy in the related diseases. 

In other embodiments, probe arrays comprising p-tum 
mirr.etics can be prepared and assayed for activity against a 
particular receptor, p-turo mimetics are compounds having 
molecular structures similar to p-mrro which are one of the 
three major components in protein molecular architecture, 
p-tums are similar in concept to hairpin rums of oligonucle- 
otide strands, and are often critical recognition features for 
various protein-Iigand and protein-protein interactions. As a 
result, a library of fJ-lum mimetic probes can provide or 
suggest new therapeutic agents having a particular affinity 
for a receptor which will correspond to the affinity exhibited 
by the p-tum and its receptor. 
Bioeloctronic Devices and Methods 

In another aspect, the present invention provides a method 
for the bioelectronic detection of sequence-specific oligo- 
nucleotide hybridization. A general method and device 
which is useful in diagnostics in which a biochemical 
species is attached to the surface of a sensor is described in 
U.S. Pat. No. 4,562,157 (the Lowe patent), incorporated 
herein by reference. The present method utilizes arrays of 
immobilized oligonucleotides (prepared, for example, using 
VLSIPS™ technology) and the known photo- induced elec- 
tron transfer which is mediated by a DNA double helix 
structure. See. Murphy ct al.. Scie/ict 262:1025-1029 
(1993). This method is useful in hybridirationbascd diag- 
nostics, as a replacement for fluorescence -based detection 
systems. The method or bioclccuonic detection also offers 
higher resolution and potentially higher sensitivity than 
earlier diagnostic methods involving sequencing/detecting 
by hybridization. As a result, this method finds applications 
in genetic mutation screening and primary sequencing of 
oligonucleotides. The method can also be used for Sequenc- 
ing By Hybridization (SBH), which is described in co- 
pending application Scr. Nos. 08/082,937 (filed Jun. 25. 
1993 now abandoned) and 08/1 68.904 (filed Dec. 15, 1993). 
each of which arc incorporated herein by reference for alt 
purposes. This method uses a set of short oligonucleotide 
probes of defined sequence to search for complementary 
sequences on a longer target strand of DNA. The hybrid- 
ization paucm is used to reconstruct the target DNA 
sequence. Thus, the hybridization analysis of large numbers 
of probes can be used to sequence long stretches of DNA. In 
immediate applications of this hybridization methodology, a 
small number of probes can be used to interrogate local 
DNA sequence. 

In the present inventive method, hybridization is moni- 
tored using bioclccuonic detection. In this method, the target 
DNA, or first oligonucleotide, is provided with an electron- 
donor tag and then incubated with an array of oligonucle- 
otide probes, each of which bears an electron-acceptor tag 
and occupies a known position on the surface of the array. 
After hybridization of the first oligonucleotide to the array 
has occurred, the hybridized array is Dluminatcd to induce 
an electron transfer reaction in the direction of the surface of 
the array. The electron transfer reaction is then delected al 
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the location on the surface where hybridization has ulcen 
place. Typically, each of the oligonucleotide probes in an 
amy will have an attached electron-acceptor tag located 
near the surface of the solid support used in preparation of 
the array. In embodiments in which the arrays are prepared 
by light-directed methods (Lc, typically 3* to 5* direction), 
the electronacccptoT lag will be located near the 3' position. 
The dectron-acceptcT tag can be attached either to the 3' 
monomer by methods known to those of skill in the an, or 
it can be attached to a spacing group between the 3' 
monomer and the solid support. Such a spacing group will 
have, in addition to functional groups for attachment to the 
solid support and the oligonucleotide, a third functional 
group for attachment of the elccironacccpior tag. The target 
oligonucleotide will typically have the electro n-donor lag 
attached at the 3' position. Alternatively, the target oligo- 
nucleotide can be incubated with the array in the absence of 
an electron-donor tag. Following incubation, the electron- 
donor tag can be added In solution. The electron-donor tag 
will then intercalate into those regions where hybridization 
has occurred. An electron transfer reaction can then be 
detected in those regions having a continuous DN A "double 
helix. 

The electron-donor tag can be any of a variety of com- 
plexes which participate in electron transfer reaction! and 
which can be attached to an oligonucleotide by a means 
which docs not interfere with the electron transfer reaction. 
In preferred embodiments, the electron-donor lag is a ruthe- 
nium (II) complex, more preferably a ruthenium (II) 
(phcn') 3 (dppz) complex. 

The electron-acceptor lag can be any species which, with 
the electron-donor tag, will participate in an electron transfer 
reaction. An example of an election-acceptor tag is a 
rhodium (III) complex. A preferred electron-acceptor tag is 
a rhodium (III) (phi) 2 (phcn*) complex. 

In a particularly preferred embodiment, the electron- 
donor tag is a ruthenium (If) (phcn') 2 (dpp2) complex and the 
electron- acceptor lag is a rhodium (HI) (phi) } (phcn') com- 
plex. 

In still another aspect the present invention provides a 
device for the bioclcctronic detection of sequence-specific 
oligonucleotide hybridization. The device will typically con- 
sist of a sensur having a surface to which an array of 
oligonucleotides arc attached. The oligonucleotides will be 
attached in pre-defined areas on the surface of the sensor and 
have an electron- acceptor tag attached to each oligonucle- 
otide. The electron- acceptor tag will be a tag which is 
capable of producing an electron transfer signal upon illu- 
rmna.ion of a hybridized species, when the complementary 
oligonucleotide bears an clcctrondonaung lag. The signal 
will be in the direction of the sensor surface and be detected 
by the sensor. 

In a preferred embodiment, the sensor surface will be a 
silicon-based surface which can sense the electronic. signal 
induced and. if necessary, amplify the signal. The metal 
contacts on which the probes will be synthesized can be 
treated with an oxygen plasma prior to synthesis of the 
probes to enhance the sitane adhesion and concentration on 
the surface The surface wfll further comprise a multi-gated 
field effect transistor, with each gate serving as a sensor and 
different oligonucleotides attached to each gate. The oligo- 
nucleotides will typically be attached to the metal contacts 
or. the sensor surface by means of a spacer group. 

The spacer group should not be too long, in order to 
ensure that the sensing function of the device is easily 
activated by the binding interaction and subsequent illumi- 
nation of the "tagged" hybridized oligonucleotides. Prcfcr- 
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ably, the spacer group is from 3 to 12 atoms in length end 
will be as described above for the surface modifying portion 
of the spacer group, L 1 . 
The oUgonucleoddes which are attached to the spacer 

5 group can be formed by any or the solid phase techniques 
which are known to those of skill in the art. Preferably, the 
oligonucleotides are formed one base at a time in the 
direction of the 3' terminus to the 5' terminus by the 
"light-directed" methods described above. The oligonudc- 

10 otide can then be modified at the 3' end to attach the 
electron-acceptor lag. A number of suitable methods of 
attachment are known. For example, ciodihcation with the 
reagent Aminolink2 (from Applied Biosys terns. Inc.) pro- 
vides a ter min al phosphate moiety which is derivatized with 

15 an aminohcxyl phosphate ester. Coupling of a carboxylic 
add, which is present on the electron-acceptor tag, to the 
amine can then be carried out using KOBT and DCC 
Alternatively, synthesis of the oligonucleotide can begin 
with a suitably derivatized and protected monomer which 

30 can then be dcprotccicd and coupled to the electron-acceptor 
tag once the complete oligonucleotide has been synthesized. 

The silica surface can also be replaced by silicon nitride 
or oxymtridc, or by an oxide of another metal, especially 
aluminum, titanium (IV) or iron (III). The surface can also 

25 be any other film, membrane, insulator or semiconductor 
overlying the sensor which will not interfere with the 
detection of electron transfer detection and to which an 
oligonucleotide can be coupled. 
Additionally, detection devices other than an FET can be 

30 . used. For example, sensors such as bipolar transistors, MOS 
transistors and the like arc also useful for the detection of 
electron transfer signals. 
Adhcsivcs 

In still another aspect, the present invention provides an 

35 adhesive comprising a pair of surfaces, each having a 
plurality of attached oligonucleotides, wherein the single- 
stranded oligonucleotides on one surface are complementary 
to the single- stranded oligonucleotides on the other surface. 
The sucr.gih and position/orientation specificity can be 

40 controlled using o number of factors including the number 
and length of oligonucleotides on each surface, the degree of 
complementary, and the spatial arrangement of complemen- 
tary oligonucleotides on the surface. For example, increas- 
ing the number and length of the oligonucleotides on each 

45 surface will provide a stronger adhesive. Suitable lengths of 
oligonucleotides arc typically from about 10 lo about 70 
nucleotides. Additionally, the surfaces or oligonucleotides 
can be prepared such that adhesion occurs in an extremely 
posit ion -specific manner by a suitable arrangement of 

50 complementary oligonucleotides in a specific pattern. Smalt 
deviations from the optimum spatial arrangement arc ener- 
getically unfavorable as many hybridization bonds must be 
broken and arc not reformed in any other relative orienta- 
tion. 

55 The adhcsivcs of the present invention will find use in 
numerous applications. Generally, the adhcsivcs are useful 
for adhering two surfaces to one another. More specifically, 
the adhesives will find application where biological com- 
patibility of the adhesive is desired. An example of a 

60 biological application involves use in surgical procedures 
where tissues must be held in fixed positions during or 
following the procedure. In this application, the surfaces of 
the adhesive will typically be membranes which are com- 
patible with the tissues to which they arc attached. 

65 A particular advantage of the adhesives of the present 
invention is that when they arc formed in an orientation 
specific manner, the adhesive portions will be "self-finding," 



5,556,752 



21 



22 



that is the system wilt go to the thermodynamic equilibrium 
in which the two sides are matched in the predetermined, 
orientation specific manner. - 

EXAMPLES 

Example 1 

This example illustrates the general synthesis or an array 
of uni molecular, double-stranded oligonucleotides on a solid 
support 

Uni molecular double stranded DNA molecules were syn- 
thesized on a solid support using standard light-directed 
methods (VLSIPS™ protocols). Two hexacthylenc glycol 
(PEG) linkers were used to covalently attach the synthesized 
oligonucleotides to the derivaiixed glass surface Synthesis 
of the first (inner) strand proceeded one nucleotide at a time 
using repeated cycles of photo-deproiecuon and chemical 
coupling of protected nucleotides. The nucleotides each had 
a protecting group on the base portion of the monomer as 
well as a photolabilc MeNPoc protecting group on the 5' 
hydroxy;. Upon completion of the inner strand, another 
MeNPoc -protected PEG linker was covalently attached to 
the S ! end of the surface-bound oligonucleotide. After addi- 
tion of the internal PEG linker, the PEG is photodeprotected, 
and the synthesis of the second strand proceeded in the 
normal fashion. Following the synthesis cycles, the DNA 
bases were depxotecied using standard protocols. The 
sequence of the second (outer) strand, being complementary 
to that of the inner strand, provided molecules with short, 
hydrogen bonded, uni molecular double-stranded structure 
as a result of the presence of the internal flexible PEG linker. 

An array of 16 different molecules were synthesized on a 
derivatized glass slide in order to determine whether short, 
unimolecular DNA structures could be formed on a surface 
and whether they could adopt structures that arc recognized 
by proteins. Each of the 16 different molecular species 
occupies a different physical region on the glass surface so 
that there is a one-to-one correspondence between molecular 
identity and physical location. The molecules arc of the form 

S-P-P-C-C-A/T-A/T-AyT-Anr-G-C-P-G-C-A/T-A/T-A/T- 
A/T-G-G-F 

where S is the solid surface having silyl groups, P is a PEG 
linker, A. C. G. and Tare the DNA nucleotides, and F is a 
fluorescent tag. The DNA sequence is listed from the 3' to 
the 5' end (the 3' end of the DNA molecule is attached to the 
solid surface via a silyl group and 2 PEG linkers). The 
sixteen molecules synthesized on the solid support differed 
in the various permutations of A and T in the above formula. 

Example 1 

This example illustrates the ability of a library of surface- 
bound, unimolecular. double -stranded oligonucleotides to 
exist in duplex form and to be recognized and bound by a 
protein. 

A library of 16 different members was prepared as 
described in Example 1. The 16 molecules all have the same 
composition (same number of As, Cs, Gs and Ts). but the 
order is different. Four of the molecules have an outer strand 
that is 100% complementary to the inner strand (these 
molecules will be referred to as DS, doublestranded, below). 
One of the four DS oligonucleotides has a sequence that is 
recognized by the restriction enzyme EcoRl. If the molecule 
can loop back and form a DNA duplex, it should be 
recognized and cut by the restriction enzyme thereby releas- 
ing the fluorescent tag. Thus, the action of the enzyme 
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provided a functional test for DNA structure, and also served 
to demonstrate that these structures can be recognized at the 
surface by proteins. The remaining 12 molecules bad outer 
strands that were not complementary to (heir inner strands 
(referred to as SS, smgle-srxanded, below). Of these, three 
had an outer strand and three had an inner strand whose 
sequence was an EcoRl half-site (the sequence on one 
strand was correct for the enzyme, but the other half was 
not). The solid support with an array of molecules on the 
surface is referred to as a "chip" for the purposes of the 
following discussion. The presence of SuorescenUy labelled 
molecules on the chip was detected using confocal fluores- 
cence microscopy. The action of various enzymes was 
determined by monitoring the change in the amount of 
fluorescence from the molecules on the chip surface (e.g. 
"reading" the chip) upon treatment with enzymes that can 
cut the DNA and release the fluorescent tag at the 5* end. 

The three different enzymes used to characterize the 
structure of the molecules on the chip were: 

1) Mung Bean Nuclease — sequence independent, single- 
strand sped fie DNA endonuclease; 

2) DNase I — sequence independent, double-strand spe- 
cific endonuclease; 

3) EcoRl — restriction endonuclease that recognizes the 
sequence (5*-3") 

GAATTC in double stranded DNA, and cuts between the 
G and the first A. Mung Bean Nuclease and EcoRl were 
obtained from New England Btolabs, and DNase 1 was 
obtained from Boehringcr Mannheim. All enzymes were 
used at a concentration of 200 units per mL in the buffer 
recommended by the manufacturer. The enzymatic reactions 
were performed in a 1 mL flow cell at 22' C, and were 
typically allowed to proceed for 90 minutes. 

Upon treatment of the chip with the enzyme EcoRl, the 
fluorescence signal in the DS EcoRl region and the 3 SS 
regions with the EcoRl half-site on the outer strand was 
reduced by about 10% of its initial value. This reduction was 
at least 5 times greater than for the other regions of the chip, 
indicating that the action of the enzyme is sequence specific 
on the chip. It was not possible to determine if the factor is 
greater than 5 in these preliminary experiments because of 
uncertainty in the constancy of the fluorescence background. 
However, because the purpose of these early experiments 
was to determine whether unimolecular double-stranded 
structures could be formed and whether they could be 
specifically recognized by proteins (and not to provide a 
quantitative measure of enzyme specificity), qualitative dif- 
ferences between the different synthesis regions were suf- 
ficient. 

The reduction in signal in the 3 SS regions with the EcoRl 
half- sue on the outer strand indicated cither that the enzyme 
cuts single-stranded DNA with a particular sequence, or that 
these molecules formed a double-stranded structure that was 
recognized by the enzyme. The molecules on the chip 
surface were at a relatively high density, with an average 
spacing of approximately 100 angstroms. Thus, it was 
possible for the outer strand of one molecule to form ft 
double-stranded structure with the outer strand of a neigh- 
boring molecule. lo the case of the 3 SS regions with the 
EcoRl half-site on the outer strand, such a bimolecular 
double-stranded region would have the correct sequence and 
structure to be recognized by EcoRl. However, it would 
differ from the unimolecular double-stranded molecules in 
that the inner strand remains single-stranded and thus ame- 
nable to cleavage by a tingle-strand specific endonuclease 
such as Mung Bean Nuclease. Therefore, it was possible to 
distinguish unimolecular from bimolecular double-stranded 
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DNA molecules on the surface by their ability to be cut by 
single and double-strand, specific cndonudeaies. 

In order to remove all molecules that have single-stranded 
smictures and to identify ummolccular double-stranded 
molecules, the chip was first exhaustively trraied wiih Mong 5 
Bean Nuclease, The reduction in the fluorescence signal was 
greater by about a factor of 2 for the SS regions of the chip, 
including those with the EcoRl half-site on the outer strand 
that were cleaved by EcoRl. than for the 4 DS regions. 
Following Mung Bean Nuclease treatment, the chip was 10 
treated wilh either DNase I (which cuts all remaining 
double-stranded molecules) or EcoRl (which should cut 
only the remaining double-stranded molecules with the 
correct sequence). Upon treatment with DNase I, the fluo- 
rescence signal in the 4 DS regions was reduced by at least 13 
5-fold more than the signal in the SS regions. Upon EcoRl 
treatment, the signal in the single DS region with the correct 
EcoRl sequence was reduced by at least a factor of 3 more 
than the signal in any other region on the chip. Taken 
together, these results indicated that the surface-bound mo)- 20 
ccules synthesized with two complementary strands sepa- 
rated by a flexible PEG tinker form intramolecular double- 
stranded structures that were resistant to a single -strand 
specific endo nuclease and were recognized by both a 
double- J irand specific endonuelcase, and a sequence-spe- 23 
rific restriction enzyme. 

What is claimed is: 

1. A synthetic uni molecular, double- stranded oligonucle- 
otide library comprising a plurality of different members, 
each member having the formula: 



y— V— x 1 — 1*— x J 



wherein, 
V is a solid support; 

X 1 and X 2 are a pair of complementary oligonucleotides; 
L 1 is 1 spacer. 

L 3 is a linking group having sufficient length such that X 1 
and X 3 form a double-stranded oligonucleotide. 

2. A library in accordance with claim 1. wherein L 3 is a 
polyethylene glycol group. 

3. A library in accordance with daira 1, wherein X 1 and 
X 3 are complementary oligonucleotides each comprising of 
from 6 to 30 nudeic acid monomers. 

4. A library in accordance with claim 1, wherein said solid 
support is a silica support and L 1 comprises an aminoalkyl- 
silane and from 1 to 4 hexaelhyleneglycols. 

5. A library in accordance with claim 1, wherein said solid 
support is a silica support, L 1 comprises an aminoallcylsUane 
and from 1 to 4 hexaethyleneglycols, L 2 is a polycthytcncg- 
lycol group and X 1 and X 2 arc complementary oligonucle- 
otides each comprising of from 6 to 30 nucleic acid mono- 
mers. , 

6. A synthetic uni molecular, double-siranded oligonucle- 
otide library of daim 1, wherein a portion of said double- 
stranded oligonucleotides formed by X 1 and X 3 further 
comprise a loop. 
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a do avc nr vfiTVPTAT <; ATTACHED TO A flow of fluids from the reactor system, selectively activating 

ARRAYS 0F ^SSSxn the translation stage, and selectively ffluminating the sub- 

auw straie so as to fonn a plurality of diverse polymer sequences 

CROSS REFERENCE TO RELATED on the substrate at predetermined locations. 

APPLICATIONS 3 The invention also provides a technique for selection of 

r . linVrr molecules in a very large scale immobilized polymer 
This application is a division of U-S. patent application 5ynthrds (VLSIPS™) method. According to this aspect of 
Set. No. 08690272. filed Feb. 16. 1995, now US. Pat. No, &c invention, the invention provides a method of screening 
5,4S9.678.which is a continuation of US. a of ft n Vrr polymers for use in binding affinity 

Ser, No. 07/624,120, filed Dec 6. 1990. now abandoned. w ^ -^rioo me ^ 0 f forming a 

which is a continuation-in-part of US. patent appbeanon lmlitY ^ Vtn \r„ polymers on a substrate in selected 
Ser. No. 07/492.462. filed Mar. 7. 1990. now U 5. Tat. No. the linker polymers farmed by the steps of recur- 

5.143.854, which is a continnation-in-part of US. patent oq & $iirfacc of & 5nbstrate> irradiating a portion of the 

application Ser. No. 07/36X901, filed Jon. 7^1939, now sdtaeAtcpons to remove a protective group, and contact- 
abandoned, and hereby incorporated herein by reference tor u ^ surface with a monomer, contacting the plurality of 
all purposes. This application is also a contmuafroo-uvpart polymers with a Hgand; and contacting the ligand with 

of US. patent application Ser No. 08/456,887. filed I Jim. 1, & Ubc Q J rcccptor . 

1995, which is a division of US. patent *PpHcatioa Set Na Acccrdillg to mother aspect of the inventioa, improved 
07/954.646. filed Sep. 30, 1992, now US. Pat. Na 5*43, ^ otorcmovablc protective groups are provided. According 
934. which is a division of US. patent application | Sex. No. ^ £ ^ r- Ac t compound having the 

07/850356. filed Mar. 12. 1992, now US. PaL No 5.405, f ftrTmiU . *~ 
783. which is a division of VS. patent application Ser. No, 
07/492,462, filed Mar. 7. 1990, now US. Par. Set No. 
5 143 854, which is a continnation-in-part of US, patent 
application Set. No. 07/362,901 filed Jan. 7, 1989, now M 
abandoned. 

This application is also related to U-S. patent application R r^^^ o&fc 

Set No. 08/670.1 18 filed Jun. 25, 1996, which is a division T 
of US. patent application Set No. 08/168,104, filed Dec 15, 0Me 
1993 which is a continuation of U.S. patent application Set x 

No. 07/624 114 filed Dec 6, 1990, now abandoned, and wherein n=0 or 1; Yis selected from the group consisting ot 
US patent^pplication Set No. 07/626.730, filed Dec 6. „ oxygen of the carboxyl group of a natural or unnatural 
1990 now U S Pat No. 5.547.839. and also incorporated xm j iao ^ an amino group of a natural or unnatural amino 
herein by reference for all purposes. acid, or the C-5' oxygen group of a natural or unnatural 

33 deODcyribonudeic or ribonndek acid; R and R lndepen- 
COFYRIGHT NOTICE dently arc a hydrogen atom, a lower allryU aryL benzyL 

S'£5Ft2?hS X^IK - ^ «yL or alltenyl group 

reproduction by anyone of the patent docunxnt or the patent «o is mxddng tech- 

disclosure as it .in the * "*J"f^^ £ e £ ^S^oSc^ccording to one 

patent file or records, bat otherwise reserves all copyright niques wr^ m „ Wn£ the invention provides an 

rights whatsoever. ordered method for forming * plurality of polymer 

BACKGROUND OF THE INVENTION 43 sequences by sequential addition of reagents conjrisiiLg the 

BAuujKUUfi v uriM ^ of serially protecting and deproteenng portions of me 

The present invention relates to the field of polymer plurality of polymer sequences for addition of other portions 

synthesis. More specificaUy, the invention provides a reactor of ^ polymer sequences using a binary synthesis strategy, 

system, a masking strategy, photoremovable protective Improved data collection equipment and techniques are 

groups, data collection and processing techniques, and appli- x ^ ^^^^ According to one embodiment, the instru- 

catioos for light directed synthesis of diverse polymer mentation provides a system for determinin g affinity of a 

sequences on aubstrates. receptor to a ligand comprising: means for applying light to 

„„„ v , a rorf see of a substrate, the substrate comprising a plurality 

SUMMARY OF THE INVENTION of ligands at predetermined locations, the means for provid- 
Methods. apparatus, and compositions for synthesis and « ing simultaneous ilhmnaation at a plurality °f*=P^^; 

nsrof^ver^p^mer se^en«on » substrate are mined location; and an array S^Tto 

disdosed. as weU« applications thereof. fluoresced at the plurality of predetermined locaaooa. The 

Accir g ro" S«t of the invention, an Improved invention ^-provides **%*££gZ^ 

J^^mfor synS of diverse polymer sequences niques bdnduig^eps of «W <J~J JSg 
^bLte is provideZXcccrding to mis embodiment toe « receptors to a '^^^^^^^J^of 

. . . jr. , „~rtnr for rnntactinB reaction fluids of Uganda in regions at known locanoos, m * ^ , 

through a mask at selected timer, and an appropriately relative binding affinity of the receptor to remammg 
programmed digital computer for selectively directing a collection points. 
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Protected amino acid N-carboxy anhydrides for use in 
polymer synthesis ait also disclosed According to this 
aspect, the invention provides a compound having the for- 
mula: 



o v o 

where R is a side chain of a natural or unnatural amino add 
and X is a pbotoremovable protecting group. 

A further understanding of the nature and advantages of 
the inventions herein may be realized by reference to the 
remaining portions of the speqficarion and the attached 
drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 schematically illustrates light-directed spatially- 
addressable parallel chemical synthesis; 

FIG. 2 schematically illustrates one example of light- 
directed peptide synthesis; 

FIG. 3 is a three-dimensional representation of a portion 
of the checkerboard array of YGGFL andfPGGFL; 

FIG. 4 schematically illustrates an automated system for 
synthesizing diverse polymer sequences; 

FIGS. Sa and 5b illustrate operation of a program for 
polymer sythesis; 

FIGS, ta and 46 are a schematic fltastration of a "pure* 
binary m miring strategy; 

FIGS, la and lb are a schematic illustration of a gray code 
binary masking strategy; 

FIGS. 8a and 8b are a schematic illustration of a modified 
gray code binary masking strategy; 
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B. Binary Synthesis Strategy 

1. Example 

2. Example 

3. Example 

4. Example 

5. Example 
.6. Example. 

C. linker SdectioQ 

D. Protecting Groups 

1. Use of Riotoremovable Groups During Solid-Phase 
Synthesis of Peptides 

2. Use of Riotoremovable Groups During Solid-Phase 
Synthesis of Oligonucleotides 

E. Amino Acid N-Caxboxy Anhydrides Protected with a 
Fhotoremovable Group 

IV. Data Collection 

A. Data Collection System 

B. Data Analysis 

w V. Other Representative Applications 
A. Oligonucleotide Synthesis 
1. Example 
VL Conclusion 
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I DEFINITIONS 



Certain asms used herein are intended to have the fol- 
lowing general definitions: 

1. Complementary: 

Refers to the topological compatibility or matching 
30 together of interacting surfaces of a ligand molecule and its 
receptee Thus, the receptor and its ligand can be described 
as complementary, and furmermore. the contact surface 
characteristics are oomplementary to each other. 

2. Epitope: 

35 The portion of an antigen molecule which is delineated by 
the area of interaction with the subclass of receptors known 
as antibodies. 

3. Ligand: 

A ligand is a molecule that is recognized by a particular 



FIG. 9o schematically illustrates a masking scheme for a 40 receptor. Examples of Uganda that can be investigated by 

...... * a ■ - - * - ■* » A aiM«UM«*« *n/4 



of all 400 



four step synthesis; 

FIG; 96 schematically illustrates synthesis 
peptide dimers; 

FIG. M is a coordinate map for the ten-step binary 
synthesis; 

FIG. 11 schematically illustrates a data collection system; 
FIG. 12 is a Week diagram illustrating the architecture of 
the data collection system; 



this invention include, but are not restricted to, agonists and 
antagonists for cell membrane receptors, toxins and venoms, 
viral epitopes, hormones, hormone receptors, peptides, 
enzymes, enzyme substrates, cofactors, drugs (eg. opiates, 
45 steriods, etc), lectins, sugars, oligonucleotides, nucleic 
acids, oligosaccharides, proteins, and monoclonal antibod- 
ies. 

4. Monomer 

A member of the set of small molecules which can be 



FIG. 13 is a flow chart illustrating operation of software 50 joined together to form a polymer. The set of roooomers 

- t ... ... . • _^ _ J . _ X - - - * • * t« *» • *4 AT 



for the data collection/analysis system; and 

FIG. 14 illustrates a three-dimensional plot of intensity 
versus position for fight directed synthesis of a dmuclectidc. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 



CONTENTS 

1 Definitions 
IL General 

Deproiection and Addition 

1. Fx ample 

2. Example 

B. Antibody recognition 
1. Example 
m. Synthesis 
A. Reactor System 



includes but is not restarted to, for example, the set of 
common 1^ amino adds, the set of D-amino acids, the set of 
synthetic amino acids, the set of nucleotides and the set of 
pentoses and hexoses. As used herein, nwnorners refers to 
55 any nr™h gr of a basis set for synthesis of a polymer. For 
example, dimers of the 20 naturally ocairring L-tmino adds 
form a basis set of 400 monomers for synthesis of polypep- 
tides. Different basis sets of monomers may be used at 
successive steps in the synthesis of a polymer. Furthennore, 
60 each of the sets may include protected members which arc 
modified after synthesis. 
5. Peptide: 

A polymer in which the monomers are alpha amino aods 
and which are joined together through amide bonds and 
65 alternatively referred to as a polypeptide. In the coutext-of 
this specification it should be appreciated that the arniao 
acids may be the L-optical isomer or the D-optical isomer. 
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Peptides are often two or more amino add monomers long, 
and often more fo*n 20 amino acid monomers long. Stan- 
dard abbreviations for amino acids are used (eg., P for 
proline). These abbreviations arc included in Server. 
Biochemistry.Thkd Ed., 1988. which is incorporated herein 
by reference for all purposes. 

6. Radiation: 

Energy which may be selectively applied ^including 
energy having a wavelength of between 1(T 14 and 10* 
meters including, for example, electron beam radiation, 
gamma radiation, x-ray radiation, ultraviolet radiation, vis- 
ible hgjit. infrared radiation, microwave radiation, and radio 
waves, "irradiation" refers to the application of radiation to 
a surface. 

7. Receptor. 

A molecule that has an affinity far a given ligand. Recep- 
tors may-be iiamrafly-occurxing or man made molecules. 
Also, they can be employed in their unaltered state or as, 
aggregates with other species. Receptors may be attached, 
covalently or noncovalently, to a binding member, either 
directly or via a specific binding substance. Examples of 
receptors which can be employed by this invention include, 
bat are not restricted to, antibodies, cell membrane 
receptors, monoclonal antibodies and antiscra reactive with 
specific antigenic deterrmnants (such as on viruses, cells or 
other materials), drugs, polynucleotides, nucleic acids, 
peptides, cefaclors, lectins, sugars, rwrysaccharides. cells, 
cellular membranes, and organelles. Receptors arc some- 
times referred to in the art as anti-ligands. As the term 
receptors is used herein, no difference in meaning is 
intended. A ligand Receptor Pair" is formed when two 
micro molecules have combined through molecular recog- 
nition to form a complex Other examples of receptors which 
can be investigated by this invention include but are not 
restricted to: 

a) Microorganism receptors: 
Determination of Ugands which bind to receptors, such 
as specific transport proteins or enzymes essential to 



e) Catalytic Polypeptides: 

Polymers, p r ef er a bly polypeptides, which are capable 
of rxt) rooting a chemical reaction involving the coo- 
version of one or more reactants to one or more 

3 products. Such polypeptides generally include a 

bidding site specific for at least one reactant or 
reaction intermediate -and an active functionality 
proximate to the binding site, which functionality is 
capable of chemically modifying the bound reaaant 

jo Catalytic polypeptides are described in. for example, 

XJS. PaL No. 5.215.899, which is inccrporated 
herein by reference for all purposes. 

f) Hormone receptors: 

Examples of hormones receptors include, eg., the 
U receptors for frmilfn and growth hormone. Determi- 

nation of the ligands which bind with high affinity to 
a receptor is useful in the development of, for 
example, an oral replacement of the dally injections 
which diabetics must take to relieve the symptoms of 
2Q diabetes, and in the other case, a replacement for the 

scarce human growth hormone which can only be 
obtained from cadavers or by recombinant DNA 
technology. Other examples are the vasoconstrictive 
hocincoc receptors; deterrmnatioc of those ligands 
25 which bind to a receptor may lead to the develop- 

ment of drugs to control blood pressure. 

g) Opiate receptors: 

Determination of ligands which bind to the opiate 
receptors in the brain is useful in the development of 
30 less-addictive replacements for morphine and related 

drugs. 
8. Substrate: 

A rn«trri»l having a rigid or semi-rigid surface. In many 
embodiments, at least one surface of the substrate will be 
35 substantially flat, although in some embodiments h may be 
desirable to physically separate synthesis regions for differ- 
ent polymers with, for example, wells, raised regions, etrhrd 
trenches, or the 1^ According to other embodiments, small 
beads may be provided on the sirrfarr which may be released 



survival of microorganisms, is useful in developing „ y 1 

a new c*«« of antibiotics. Of particular value would ^ U p 0n completion of the synthesis, 

be antibiotics against cpCKirtunistic fungi, protozoa, 9 Protective Group: 

and those bacteria resistant to the antibiotics in A pur/riil which is chemically bound to a monomer unit 

current use. and which may be removed upon selective exposure to an 

b) Enzymes: activator such as electromagnetic radiation. Examples of 
For instance, one type of receptor is the binding site of 45 protective groups with utility herein include those compris- 

enzymes such as the enzymes responsible for cleav- ing nitropiperooyL rryrenylnKthccry-carbooyL nitroveratryL 

ing neurotransmitters; determination of ligands nitrobenzyl, dimethyl dlmethoxybcnzyl, 5-bromo-7- 

which bind to certain receptors to modulate the nitroindolinyl, o-hydroxy-a-methyl cinnamoyl, and 

action of the enzymes which cleave the different 2-oxymethyiene anthraquinorje. 

neurc^ansmitters is useful in the development of 50 10. Predefined Region: 

drugs which can be used in the treatment of disorders A predefined region is a localized area on a surface which 

of neurotransmission. is. was, or is irtfendrrf to be activated for formation of a 

c) Antibodies: polymer. The predefined region may have, airy convenient 
For instance, the invention may be useful in investi- shape, eg., circular, rectangular, ellipticaL wedge-shaped. 

gating me ligariaVbtnding site 00 the antibody mol- 35 etc. For the sake of brevity herein, •predefined regions" are 

ecule which combines with the epitope of an antigen sometimes referred to simply as M regk>ns." 

of interest; <Vcterrnining a sequence that mimics an 11. Substantially Pure: „ . . . 

antigenic epitope may lead to thc-development of A polymer is considered to be "substantially pure within 

vaccines of which the immunogen is based on one or a predefined region of a substrate when it exhibits charac- 

more of such sequences or lead to the development 60 teristics that distinguish it from other predefined regions. 



of related diagnostic agents or compounds useful in 
therapeutic treatments such as for auto-irnmune dis- 
eases (eg., by blocking the binding of the "self' 
antibodies). 

d) Nucleic Acids: 
Sequences of nudeic acids may be synthesized to 
establish DNA or RNA b i n ding sequences. 



Typically, purity will be measured in terms of biological 
activity or function as a result of uniform sequence. Such 
characteristics will typically be measured by way of binding 
with a selected ligand or receptor. 
65 12. Activator refers to an energy source adapted to render a 
group active and which is directed from a source to a 
predefined location on a substrate. A primary illustration of 
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an activator is light Other examples of activator* include ion Ate deprotectiotL a first of a set of bunding blocks 
beams electric fidds.magflctic fields, electron beams, x-ray, (indicated by "A" in FIG. 1). each beariog a photolabile 
and the like. protecting group (u>dicaied by ~X") is exposed to the surface 

13. Binary Synthesis Strategy refers to an ordered strategy of the substrate and it reacts with regions thai were 
for parallel synthesis of diverse polymer sequences by 5 addressed by light in the preceding step. The substrate is 
sequential addition of reagents which may be represented by then illuminated through a second mask 4b, which activates 
a reacunt matrix, and a switch matrix, the product of which another region for reaction with a second protected building 
is a product matrix. A reactant matrix is a Ixn matrix of the block "B". The pattern of masks used in these illuminations 
building blocks to be added. The elements of the switch Md ^e sequence of reactants define the ultimate products 
matrix are binary numbers. In preferred embodiments, a {Q ^ Iq^q^ resulting in diverse sequences at pre- 
binary strategy is one in which at least two successive steps dc ^ Bcd locations, as shown with the sequences ACEG and 
fllnrmnate half of a region of interest on the substrate. In BDFR k me lowcr Q f FIG. 1. Preferred emboc*- 

most preferred embodiments, binary synthesis refers to a mects of ^ ^ven^ ^ advantage of combinatorial 
synthesis strategy which also factors ^ masking strategies to form a large number of compounds in 

step. Foe example, a strategy « ^^il 15 a smaU number* chemical steps. 

masking strategy halves rt^ons tat A ^ * of ^^aTon is possible because the 

muminated, aiurninaring about half of the previously uiu- ^ . . . . , , JL.w, „„>u ^.^^ 

SSo3p^ting the remaining half (while also density of compounds is determined largely with regard to 
^L bn^oc^ly protected regions and spatial ad^s*^ of the activator, xn one case the *f- 
S^STlSut half c/^vicJryVotected regions). It fraction of light Each compound is physically accessible 
will be wosniaed that binary rounds may be interspersed 20 and its position is precisely known. Hence, the array is 
with non-bituoryrounds and that only a portion of a substrate spatially-addressable and its interactions with other mol- 
may be subjected to a binary scheme, but will still be ecules can be assessed. 

considered to be a binary pm<Vin £ scheme within the In a particular embodiment shown in FIG. 1, the substrate 
definidon herein- A binary "masking" strategy is a binary contains amino groups that are blocked with a photolabile 
synthesis which uses light to remove protective groups from 25 protecting group. Amino add sequences are made a cce ssible 
m^rWiit for addition of other nufrriah such as amino acids. fa coupling to a receptor by removal of the photoprotectfve 
In Referred embodiments, selected columns of the switch groups, ' 

matrix arc arranged in order of increasing binary numbers in When a polymer sequence to be synthesized is, for 
the columns of the switch matri x. example, a polypeptide, amino groups at the ends of linkers 

14. linker refers to a molecule or group of molecules K attachcd to a substrate are derivatized with nitrovcra- 
arrarhed to a substrate and spacing a synthesized polymer tryloxycarbonyl (NVOC), a pbc<cremovable protecting 
from the substrate for exposure/binding to a receptor. group. The HnVrr molecules may be. for example, aryl 

EL General acetylene, ethylene glycol oligomers containing from 2-10 

■Tk present invention provides synthetic strategies and mooom«. tEamines. diadds, mkoiodi or cQmbiwrioDS 
ocvte foTthe creation of Urge sole chemical diversity, 33 thereof. Fhotodeprotecnon is effected by illnrmn.n on of the 
Solid-phase chemistry, photolabile protecting groups, and substrate through, for example, a mask wherein the pattern 
photolimography are brought together to achieve light- has transparent regions with dunenMoas of. for example, 
directed spatially-addressable parallel chemical synthesis in less than 1 cm 1 , lCT* cm 3 . Iff" 3 cm 1 . 10" cm . 10 cm . 
P^fo^mboLnents. 1<T 5 em*. Iff* cm 1 , 1<T 7 cm 1 , 1CT* cm 3 , or Mr" cmMn 

The invention is described herein for purposes of fllus- « a preferred embodiment, the regions are between abort 
tration primarily with regard to the preparation of peptides 10x10 urn and 500x500 Jim. According to some 
and nucleotides, but could readily be applied in the prepa- embodiments, the masks are arranged to produce a check- 
ration of other polymers. Such polymers include, for aboard array of polymers, although any one of a variety of 
example, both linear and cyclic polymers of nucleic acids, geometric configurations may be utilized. 
Dotvsaccharides, phospholipids, and peptides having either 43 1. Example 

cT B- or ©-amino adds, heteropotymers in which a known In one example of me invention, free amino £ oa P*^« e 
drug is covalentry bound to «y1f the above, poryurtthanes. fluorescent* labelled by treatment of l*"*"**™ 
poWesters. polycarbonates, polyureas. polyamides. surface with fluorescein uotmocynate (FITO ate photo- 
r^yemylencimines. polyarylene sulfides, polysiloxaaes. deprotection. Glass microscope slides were cleaned, ami- 
EoMmiL polvacetaw. or other polymers whkh wfll be 30 nated by treatment with 0.1% anmiopropylrietio^silancin 
'w^n^HcToi mis dud«rit wfll be recog- 95% ethanol. and intubated at 110" C for 20 mu. The 
Sled furme7that fllustratioos herein are primarily with aminated surface of the slide was ;then exposed to f »30mM 
reference to C- to N-teamnal synthesis, but the invention soludbn of the N-hydroxywconinnde ester of NVOC- 
could readily be applied to N- to C-terminal synthesis GAB A (nitrweratrylorycarbonyl-T-amino butyric aci d) in 
withoutdcpaiting fiSnthc scope of the invention. 33 DMF. The NVOC protecting group was photolyticaUy 

A^xoSnand AddinoT^ rcnxTved by imaging the 365 nm output from «Hg arc Ump 

Thepresent invention uses a masked light source or other through a chrome on glass 100 urn dteckerboard nustonlo 
activator! direct the simultaneous synthesis of many dif- the substrate for 20 mm * » I™« of Rj^i^ 

fcrcnt chemical compounds. FIG. 1 is a flow chart fllnstrat- The exposed surface was then treated with 1 mM F1TC in 
ing the process of faming chemical compounds according to DMF. The substrate surface was samncd in epj- 
to one embodiment of the invention. Synthesis occurs on a fluorescence microscope (Zeiss Axioskop 20) usoi« 4»8irm 
soUd support 2. A pattern of illumination through a mask 4a excitation from an argon ton laser (Spectra-Riysics model 
usmgTK^^^tcnnines which retfon, cf the 2025). The fluorescence emission above iJOjm was 
support arT^ctivated for chemical coupling. In one preferred detected by a cooled photon^er (H™^u 943^ 
onSodiment activation is accomplished by using light to 45 opo^bapto^ii^Hw™!^ 
remove photolabile protecting groups from selected areas of was translated into a color display with red in 
feembstrate r intensity and black in the lowest intensity areas. The prcs- 
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ence of a hi go-contrast fluorescent chcckcboard pattern of mean intensity of sixteen YGGFL synthesis sites was 2.03 x 

100x100 um elements revealed that free amino groups were 10 s counts and the standard deviation was 9.6x10* counts, 

generated in specific regions by spadaUylocalized photo- jjl Synthesis 

deprotection. a. Reactor System . 

Z EXAMPLE * FIG. 4 schematically illustrates a device used to synthe- 

FIG. 2 is a flow chart illustrating another example of the diverse polymer sequences on a substrate. The 

invention. Carboxy -activated NVOC-leacine was allowed to substrate, the area of synthesis, and the area for synthesis of 

react with -an aminated substrate. The .carboxy activated . each individual polymer could be of any size or shape. For 

HOBT ester of leucine and other amino adds used in this example, squares, ellipsoids, rectangles, triangles, circles, or 

synthesis was formed by mixing 0.25 mmol of the NVOC to portions thereof, along with irregular geometric shapes may 

amino protected amino acid with 37 mg HOBT be utilized. Duplicate synthesis areas may also be applied to 

(1-hydroxybenzotriazole). Ill mg BOP (benzocdazolyl-n- a single substrate for purposes of redundancy, 

oxy-tris (dimethylamino)-phosphoniumhexa- In one embodiment, the predefined regions on the sub- 



VA^ ••<« \ — y - — — — y f — £ ■» — » 

fluorophosphate) and So ul DIEA(cUisoprcpylemylamine) in strate win have a surface area of between about 1 cm and 
2.5 ml DMF. The NVOC protecting group was removed by 15 10~ lo cm 3 . In some embodiments the regions have areas of 
uniform illumination. Carboxy-activated NVOC- less than about 1CT 1 cm*, 10 -2 em 3 . lCT'cm 2 . 10^ cm\ 10" 5 
phenylalanine was coupled to the exposed amino groups for cm 1 . 10 -6 cm 1 . l(T 7 cm a , 1CT* cm 2 , 10 cm" or 10" cm". 
2 hours at room temperature, and then washed with DMF In a preferred embodiment, the regions are" between about 
and methylene chloride. Two unmasked cycles of photo- 10x10 um. 

deprotection and coupling with carboxy-activated NVOC- 20 in some embodiments a single substrate supports more 
glycine were carried out. The surface was then fflurainated than aboot 10 different monomer sequences and perferably 
through a chrome on glass 50 ul checkerboard pattern mask. more than about 100 different monomer sequences, alAough 
Carboxy-activared Na*tBOC-0-tBuryl-L-tyrosine was then in some embodiments more than aboot l(r, 10*, 10 3 , lCr, 
k aa*a The entire surface was uniformly illuminated to 10 7 , or 10* different sequences are provided on a substrate, 
photolyze the remaining NVOC groups. Finally, carboxy- 25 Of course, within a region of the substrate in which a 
activated NVOC-L-proline was added, the NVOC group monomer sequence ii sy a t hc ^' T fd, it is preferred that the 
was removed by illumination, and the t-BOC and t-butyl monomer sequence be substantially pure. In some 
protecting groups were removed with TFA. After removal of embodiments, regions of the substrate contain polymer 
the protecting groups, the surface consisted of a 50 urn sequences which are at least about 1%, 5*. 10%, 15%. 20*. 
checkerboard array of Tyr-Gly-Gly-Phe-Leu (YGGFL) 30 25%, 30%, 35%, 40%, 45% ? 50%, 60%, 70%, 80%, 90%, 
(Seq. ID No:l) and Pro-Gly-Gly-Fhe-Leu (PGGFLXSeq. ID 95%, 96%. 97%. 98%. or 99* pure. The device includes an 
j^ 01 2j ; automated peptide synthesizer 40 L The automated peptide 

B. Antibody Recognition synthesizer is a device which flows selected reagents 

In one preferred embodiment the substrate is used to through a flow cell 4*2 under the direction of a computer 
determine which of a plurality of amino acid sequences is 35 404. In a preferred embodiment the synthesizer is an ABI 
recognized by an antibody of interest. Peptide Synthesizer, model do. 43 1A. The computer may be 

1. EXAMPLE selected from a wide variety of computers or discrete logic 

In one example, the array of pentapeptides in the example including for, example, an IBM PC-XT or similar computer 
illustrated in FIG. 2 probed with a mouse monoclonal linked with appropriate internal control systems in the 
antibody directed against p^ndcspbmThis antibody (called 40 peptide synthesizer. The PC is provided with signals from 
3E7) is known to bind YGGFL and YGGFM (Seq. ID the board computer indicative of, for example, the end of a 
No:21) with nanomolar affinit y and is discussed in Moo et coupling cycle. 

aL, Prvc. NatL Acad. Sci. USA (1983) 80:4084, which is Substrate 4*6 is mounted on the flow cell, fecming a 
mccrporated by reference herein for all purposes. This cavity between the substrate and the flow cdl Selected 
antibody requires the amino terminal tyrosine for high 45 reagents flow through this cavity from the peptide synthe- 
affinity binding. The array of peptides formed as described sizer at selected times, forming an array of peptides on the 
in FIG. 2 was incubated with a 2 ug/ml mouse monoclonal face of the substrate in the cavity. Mounted above the 
antibody (3E7) known to recognize YGGFL- 3E7 does not substrate, and preferably in contact with the substrate is a 
bind PGGFL. A second incubation with fluorescein arrd goat mask 408. Mask 448 is transparent in sel ect ed regions to a 
anu^mousc antibody labeled the regions that bound 3E7.Tbe 30 selected wavelength of light and is opaque in other region 
surface was scanned with an cpi-fluorescence microscope. to the selected wavelength of light. The mask is illuminated 
The results showed alternating bright and dark 50 urn with a light source 4l# such as a UV light source. In one 
squares indicating that YGGFL and PGGFL were synthe- specific embodiment the light source 410" is a model no. 
sized in geometric array determined by the mask. A high 82420 made by OrieL The mask is held and translated by an 
contrast (>12:1 intensity ratio) fluorescence checkerboard 55 x-y-z translation stage 412 such as an x-y translation stage 
image shows that (a) YGGFL and PGGFL were synthesized made by Newport Corp. The computer coordinates action of 
in alternate 50 urn squares, (b) YGGFL attached to the the peptide synthesizer, x-y translation stage, and light 
surface is accessible for binding to antibody 3E7. and (c) source. Of course, the invention may be used in some 
antibody 3E7 docs not bind to PGGFL- emrx>dimcnts with translation of the substrate instead of the 

A three-dimensional representation of the fluorescence 60 mask, 
intensity data in a portion of the checkboard is shown in FIG. In operation, the substrate is mounted on the reactor 
3 This figure shows that the border between synthesis sites cavity. The slide, with its surface protected by a suitable 
is sharp. The height of each spike in this display is linearly • photo removable protective group, is exposed to fagnt at 
proportional to the integrated fluorescence intensity in a Z5 selected locations by positioning the mask and fflnrmn a tin g 
urn pixel The transition between PGGFL and YGGFL 65 the light source for a desired period of time (such as, for 
occurs within two spikes (5 urn). There is little variation in example, 1 sec to 60 min in the case of peptide syndesis), 
the fluorescence intensity of -different YGGFL squares. The A selected peptide or other n»nc*ner7polyraer is pumped 
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through the reactor cavity by the peptide synthesizer for in a synthesis region. A substrate formed with mixtures of 

KrvKrg at the selected locations on the substrate. After a compounds in various synthesis regions may be used to 

selected reaction time (such as about 1 sec to 300 min in the perform, for example, an initial screening of a Urge number 

case of peptide reactions) of the monomer is wished from compounds, after which a smaller number of compounds 

the system, the mask is appropriately repositioned cr 3 » regions which exhibit high binding affinity are further 

replaced, and me cycle is repeated. In most embodiments of Similar results may be obtained by only partially 

thTinvenrion. reactions tr*7b^<*cted at or near ambient pbotylrzmg a region, adding a first monomer, rc-photylizuig 

i7^,^l,Atiiff- . the same region, and exposing the region to a second 

FIGS. Sa and Sb axe flow charts of the software used in Synthcsi$ Strategy 

operation of the reactor system. At step 502 the peptide W ba ^directed chemical synthesis, the products 

synthesis software is initialized. At step 504 the system formed depend on the pattern and order of masks, and on the 

calibrates positioners on the x-y translation stage and begins rf reacmts> To a set of products there will in 

a main loop. At step 5*6 the system detennines which, if ge0C ral be "n* possible masking schemes. In preferred 



any. of the function keys oo the computrr have been pressed. cniodimeiits of the invention herein a binary synthesis 
If Fl has been pressed, the system prompts the user for input 15 j^egy-is utilized The binary synthesis strategy is Uhis- 
of a desired synthesis process. If the user enters F2. the . xnted herein primarily with regard to a masking strategy, 
system allows a user to edit a file for a synthesis process at although it will be applicable to other polymer synthesis 
step 510. If the user enters F3 the system loads a process strategies such as the pin strategy, and the like, 
from a disk at step 512. If the user enters F4 the system saves ^ a binary synthesis strategy, the substrate is irradiated 
an entered or edited process to disk at step 514. If the user 20 with a first m»%V_ exposed to a first building block, irradiated 
selects F5 the current process is displayed at step 514 while with * second m«ir exposed to a second building block, etc 
selection of F6 starts the main partioo of the program, Le.. combination of masked irradiation and exposure to a 

me actual synthesis according to the selected process. If the building block is referred to herein as a "cycle." 
user selects F7 the system displays the location of the In a preferred binary masking scheme, the masks for each 
synthesized peptides, while pressing Fll returns the user to 25 cy^c allow irradiation of half of a region of interest on the 
the disk operating system. ^ substrate and rxotectiofi of the remaining half of the region 

FIG. Sb illustrates the synthesis step 518 in greater detail, of interest By TulT it is intended herein not to mean 
The main loop of the program is started In which the system exactly one-half the region of interest, but fn*t^H a large 
first moves the mask to a next position at step 516. During fraction of the region of interest such as from about 30 to 70 
the main loop of the program, necessary chemicals flow 30 percent of the region of interest It will be understood that 
through the reaction cell under the direction of the oa-board ^ entire moving *rhm* need not take a binary form; 
computer in the peptide synthesizer. At step 52S the system instead non-binary cycles may be introduced as desired 
then waits for an exposure command and. upon receipt of the between binary cycles. 

exposure command exposes the substrate for a desired time jn preferred embodiments of the binary wmlrfng scheme, 
at step 530. When an acknowledge of exposure complete is 33 t given cyde Qluminates only about half of the region which 
received at step 532 the system determines if the process is W4S lUuminaXed in a previous cycle, while protecting the 
complete at step 534 and, if so, waits for additional keyboard remaining half of the iituwi'mM portion from the previous 
input at step 536 and. thereafter, exits the perform synthesis cyde. Conversely, in such preferred embodiments, a given 
process. cyde illuminates half of the region which was protected in 

A computer program used for operation of the system *Q ^ c previous cyde and protects half the region which was 
described above is induded as microfiche Appendix A protected in a previous cycle. 

(Copyright. 1990. Afifymax Technologies N.V., all rights The synthesis strategy is most readily illustrated and 
reserved). The program is written in Turbo C++ (Borland handled in matrix notation. At each synthesis site, the 
Int'l) and has been implemented in an IBM compatible deterjninxtion of whether to add a given monomer is a binary 
system. The motor control software is adapted from software 45 process. Therefore, each product element ¥j is given by the 
produced by Newport Corporation. It will be recognized that ^ product of two vectors, a chemical re act ant vector, eg., 
a large variety of programming languages could be utilized 0[AJB,CD]. and a binary vector o r Inspection of the 
without ceparting from the scope of the invention herein. products in the example below for a four-step synthesis. 
Certain calls are made to a grarAics program in "Program- jfaows that in one four-step synthesis Oyv[l,0,l,0], Oj-[1.0, 
mer Guide to PC and PS2 Video Systems* (Wilton, » 0 t j o ^l0.M.0J, and o 4 =[0.1,0.1], where a 1 indicates 
Microsoft Press, 1987), which is incorporated herein by {nomination and a 0 indicates protection. Therefore, it 
reference for all purposes. becomes possible to build a "switch matrix*' S from the 

Alignmerit of the mastis achieved by one of two methods column vectorso, (pi X where kis the number rfjxc<hicts). 
jn f ynh ftdiWnt*. In a first rmbo d im i r n t the system 

relies upon relative alignment of the various components, 33 oi os o* a< 

which is normally acceptable since x-y-z translation stages 1100 
are capable of sufficient accuracy for the purposes herein. In 5aQ 0 j j 

alternative embodiments, alignment matte on the substrate 



are coupled to a CCD device for appropriate alignment. 

According to some embodiments, pure reagents are not 60 
*AArA & step, or complete photolysis of the protective 

groups is not provided at each step. According to these The outcome P of a synthesis is simply P*CS. the product 

embodiments, multiple products will be formed in each of the chemical rr arrant matrix and the switch rnatrix. 

synthesis site! Far example, if the monomers A and B are The switch matrix for an n-cycle synthesis yielding k 
mixed during a synthesis step. A and B will bind to depro- 65' products has n rows and k columns. An important attribute 

tccted regions, roughly in proportiofl to their concentration of S is that each row specifies a mask. A two-^Umciisional 

in solution. Hence, a mixture of compounds will be formed mask m, for the jth chemical step of a synthesis is obtained 
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directly from the jtfa tow of S by placing the dements s^> . 
. . into, for example, a square format The particular 
arrangement below provides a square formal, although lin- 
ear or other arrangements may be utilized. 



*u *n *o *u 



*u *a u i* 



*j\ m *n 



■5 s 



'II *C *U 



Of course, compounds formed in a light-activated syn- 
thesis can be positioned in any defined geometric array. A 
square ox rectangular matrix is convenient hut not required. 
The rows of the switch matrix may be transformed into any 
convenient array as long as equivalent transformations are 
used for each row. 

For example, the mask in the four-step synthesis below 
are then denoted by: 



l 1 



o o 



0 0 1 



I 
1 



0 0 



where 1 denotes Qhrrninarion (activation) &nd 0 denotes no 
jUnrrrination. 

The matrix representation is used to generate a desired set 
of products and product maps in preferred embodiments. 
Each compound is defined by the product of the chemical 
vector and a particular switch vector. Therefore, for each 
synthesis address, one simply saves the switch vector, 
assembles all of them into a switch mafrir, and extracts each 
of the rows to form the masks. 

In some cases, particular product distributions or a maxi- 
mal number of products are desired. For example, for 
C=fAJB.CD], any switch vector (oj) consists of four bits. 
Sixteen four-bit vectors exist. Hence, a maximum of 16 
different products can be made by sequential arMirion of the 
reagents [A3.CJD]. These 16 column vectors can be 
assembled in 16! different ways to form a switch matrix. The 
order of the column vectors defines the masking patterns, 
and. therefore, the spatial ordering of products but not their 
make op. One ordering of these columns gives the following 
switch matrix (in which "null" (6) additions are included in 
brackets for the sake of completeness, although such null 
additions are elsewhere ignored herein): 
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locations on the substrate are simply defined by the columns 
of the switch matrix (the first column indicating, for 
example, that the product ABCD will be present in the upper 
left-hand location of the substrate). Furthermore, if only 
selected desired products are to be- made, the mask sequence 
can be derived by extracting the columns with the desired 
sequences. For rumple, to form the product set ABCD. 
ABD. ACD. AD. BCD. BD. CD. and D. the masks are 
formed by use of a switch matrix with only the 1st. 3rd, 5th, 
7th. 9th. lltb, 13th, and 15th columns arranged into the 
switch matrix: 



l 
1 
l 
l 



l 
1 
o 
t 
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1 
1 
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0 
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1 
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1 
1 
1 
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0 
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0 
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1 
1 



0 
0 
0 

1 



To form all of the polymers of length 4, the reactant matrix 
[ ABCD ABCD ABCD ABCDJis used. The switch matrix will 
be formed from a matrix of the binary numbers from 0 to 2 i6 
arranged in columns. The columns having four mooomers 
are than selected and arranged into a switch matrix. 
Therefore, it is seen that the binary switch matrix In general 
wiQ provide a representation of all the products which can 
be made from an n-step synthesis, from which the desired 
products are then extracted. 

The rows of the binary switch mnrriT will, in preferred 
embodiment % have the property that each masking step 
illuminates half of the synthesis area. Each moving step 
also factors the preceding masking step; that is. half cf the 
region that was DUuminaled in the preceding step is again 
iUuminated. whereas the other half is not Half of the region 
that was uniUuminated in .the preceding step is also 
iUuminated, whereas the other half is not Thus, masking is 
recursive. The musks are constructed, as described 
previously, by extracting the dements of each row and 
placing them in a square array. For example, the four matVt 
in S for a four-step synthesis are: 
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The recursive factoring of masks allows the products of a 
light-directed synthesis to be represented by a polynomial 
(Same light activated syntheses can only be denoted by 
irreducible, Le,, prime polynomials.) For example, the poly - 
55 normal eccrespoading to the top synthesis of FIG. 9a 
(discussed below) is 

. P-(A+BXC +D) 

The columns of S according to this aspect of the invention 

arc the binary representations of the numbers 15 to 0. The A reaction polynomial may be expanded as though it were 
sixteen products of this binary synthesis are ABCD. ABC 60 an algebraic expression, provided that the order of joining of 



ABD. AB. ACD. AC, AD. A. BCD. BC BD. B. CD. C, D, 
and 6 (null). Also note that each of the switch vectors from 
the four-step synthesis masks above (and hence the synthesis 
products) are present in the four bit binary switch matrix. 
(See columns 6, 7, 10. and 11) 

This synthesis procedure provides an easy way for map- 
ping the completed products. The products in the various 



reactants X 4 and Xj is preserved (XjX 3 iCXOCj). Le., the 
products are not commutative. The product then is AC+AD+ 
BC+BD. The polynomial explicitly specifies the r e a ctan ts 
and implicitly specifies the maskfor each step. Each pair of 
65 parentheses rVrrtarcitrs a round of synthesis. The ch emical 
reactants of a round (eg., A and B) react at oonoverlapping 
sites and hence cannot combine with one other. The synrhe- 
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sis area is divided equally amongst the elements of a round 
(eg., A is directed to one -half of the area and B to the other 
half). Hence, the masks for a round (eg- the masks inland 
znB) are orthogonal and form an orthonormal set The 
polynomial notation also signifies that each element in a 
round is to be joined to each element of the next round (e.g.. 
A with C A with D. B with C and B with D). This is 
a cc o m plished by having overlap m A an m^ equally, and 
likewise for Because C and D are elements of a round, 
me and % are orthogonal to each other and form an 
orthonormal set. 

The polynomial representation of the binary synthesis 
described above, in which 16 products are made from 4 
reactants, is 

which gives ABCD. ABC, ABD. AB, A CD, AC, AD. A. 
BCD. BC BD, B, CD, C, D. and • when expanded (with the 
rule that 6X*0C and X6=OC and remembering that joining is 
ordered). In a binary synthesis, each round contains one 
reactant and oae null (denoted by 6). Half of the synthesis 
area receives the reactant and the other half receives nothing. 
Each mask overlaps every other mask equally. 

Binary rounds and non-binary rounds can be interspersed 
as desired, as in 

pKA+flXBxc+i>*ex&+F*o) 

The 18 compounds formed are ABCE ABCF. ABCG. 
ABDE. ABDF. ABDG, ABE. ABF, ABG. BCE, BCF, BCG. 
BDE, BDF. BDG. BE, BF. and BG. The switch matrix S for 
this 7-step synthesis is 

1 1 1 1 1 1 i t l c o o o o o o o o 
l i i i l l i l l l i i i i i l i i 

1 t 10000001 1 1000000 
5=<0 001 t 100000011 1000 
100100100100100100 
010010010010010010 
001001001001001001 

The round denoted by (B) places B in all products because 
the reaction area was uniformly activated (the mask for B 
consisted entirely of l*s). 

The number of compounds k formed in a synthesis 
consisting of r rounds, in which the ith round has b, chemical 
reactants and z, nulls, is 

and the number of chemical steps n is 



The number of compounds synthesized when b=a and r=0 in 
all rounds is a" 1 ", compared with 2" for a binary synthesis. 
For s»20 and a-5. 625 compounds (all tetrameros) would be 
formed, compared with 1.049x10* compounds in a binary 
synthesis with the same number of cbrmiral steps. 

It should also be noted that rounds in a polynomial can be 
nested, as in 

(A+<B+exo«xr>t4) 
The products are AD. BCD. BD, CD. D, A, BC B, C, and 

e. 

Binary syntheses are attractive for two reasons. First they 
generate the m»Tirrul number of products (2") for a given 
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number of chemical steps (n). For four reactants. 16 com- 
pounds are formed in the binary synthesis, whereas only 4 
are made when each round has two reactants. A lO^step 
binary synthesis yields 1.024 compounds, and a 20- step 

5 synthesis yields 1.043-576. Second, products formed in a 
binary synthesis are a complete nested set with lengths 
ranging from 0 to o. All compounds that can be formed by 
df Irring one or more units from the longest product (the 
n-mer) are present. Contained within the binary set are the 

1Q smaller sets that would be formed from the ^arn<- reactants 
using any other set of masks (eg.. AC, AD, BC, and BD 
formed in the synthesis shown in FIG. 6 are present in the 
set of 16 formed by the binary synthesis). In some cases, 
however, the experimentally achievable spatial resolution" 
may not suffice to accommodate all the compounds formed. 

u Therefore, practical limitaa'oos may require one to select a 
pam'nil a r subset of the possible switch vectors for a given 
synthesis. 

1. EXAMPLE 

FIG. 6 illustrates a synthesis with binary masking scheme 
30 The binary muring scheme provides the greatest number of 
sequences for a given number of cycles. According to this 
embodiment, a mask nil allows illumination of half of the 
substrate, The substrate is then exposed to the building block 
A, which binds at the Ruminated regions. 
25 Thereafter, the mask m2 allows iUiimination of half of the 
previously .fllnminafrd region, while protecting half of the 
previously illnminitrd region. The building block B is then 
added, which binds at the illuminated regions from m2. 
The process continues with ™«v« m3, m4, and m5, 
30 resulting in the product array shown in the bottom portion of 
the figure. The process generates 32 (2 raised to the power 
of the number of moncmers) sequences with 5 (the number 
of monomers) cycles. 

2. EXAMPLE 

35 FIG. 7 illustrates another preferred binary masking 
scheme which is referred to herein as the gray code masking 
scheme. According to this embodiment, the masks ml to m5 
are selected such that a side of any given synthesis region is 
defined by the edge of only one mask. The site at which the 

40 sequence BCDE is formed, for example, has its right edge 
rtffinrd by m5 and its left side formed by mask m4 (and no 
other mask is aligned on the sides of this site). Accordingly, 
problems created by misalignment, diffusion of light under 
the mask and the like will be minimized. 

45 3. EXAMPLE 

FIG. 8 illustrates another binary masking scheme. 
According to this scheme, referred to herein as a modrfird 
gray code masking scheme, the number of masks needed is 
minimT7rd For example, the mask m2 could be the same 

so mask as ml and simply translated laterally. Similarly, the 
mask m4 could be the same as mask m3 and simply 
translated laterally. 
4. EXAMPLE 

* **■ # 

A four-step synthesis is shown in FXG. 9a. The reactants 
55 are the ordered set { Aji.GD}. In the first cycle, fllirrrrinatkMi 
through m l activates the upper half of the synthesis area. 
Building block A is then added to give the distribution 6*2. 
IlUrmi nation through mask m? (which activates the lower 
half), followed by addition of B yields the next intermediate 
60 distributioa 604. C is added after illumination through m 3 
(which activates the left half) giving the distribution 6*4, 
and D after illumination through m* (which activates the 
right half), to yield the final product pattern 6*8 (ACAD, 
BC3D). 
65 5. EXAMPLE 

The above masking strategy for the synthesis may be 
extended for all 400 dipeptides from Che 20 naturally occur- 



• 
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ring amino acids as shown in FIG. 96. The synthesis consists 
cf two rounds, with 20 photolysis and chemical coupling 
cycles per round. In the first cycle of round 1. mask I 
activates '/20th of the substrate for coupling with the first of 
20 amino acids. Nineteen subsequent illumination/coupling 
cycles in round 1 yield a substrate consisting of 20 rectan- 
gular stripes each bearing a distinct member of the 20 amino 
acids. The masks of round 2 are perpendicular to round 1 
masks and therefore a single ill umioati pc/co upling cycle in 



18 

of the controls needed to assess the fidelity of a synthesis. 
For example, the fluorescence signal from a synthesis area 
nominally containing a tetrapeptide ABCD could come from 
a tripeptide deletion imparity such as ACD. Such an artifact 
would be ruled out by the finding thai the fluorescence 
intensity of the ACD-site is less than that of the ABCD site. 

The fifteen most highly labelled peptides in the array 
obtained with the synthesis of 1.024 peptides described 
above, were YGAFLS (SEQ.H) No J). YGAFS (SEQ. ID 



round 2 yields 20 dipeptides. The 20 mumination/coupling 10 No:6). YGAFL (SEQ. ID No:7), YGGFLS (SEQ. XJD No:8). 

cycles of round 2 complete the synthesis of the 400 dipej> YGAF (SEQ. ID No:8), YGALS (SEQ. ID No;9). YGGFS 

tides. (SEQ. ID No:l0), YGAL(SEQ. ID No: 1 1) . YGAFLF (SEQ. 

6. EXAMPLE ID No;12) f YGAF (SEQ. ID No: 13). YGAFF (SEQ. ID 

The power of the binary masking strategy can be appre- No: 14), YGGLS (SEQ. ID No:15). YGGFL (SEQ. ID 

dated by the outcome of a 10-step synthesis that produced 15 No:16), SEQ. ID No:17), and YGAFLSF (SEQ. I fifteen 

1.024 peptides. The polynomial expression for this 10-step begin with YG. which agrees with previous work showing 

binary synthesis was: that an ami no-terminal tyrosine is a key determinant of 

4 u other F or L. The exclusion of S and T from these 

Each peptide occupied a 400x400 urn square. A 32x32 20 positions is clear cut. The finding that the preferred sequence 

peptide array ( 1,024 peptides, including the null peptide and is YG (A/G) (F/L) fits nicely with the outcome cf a study in 

10 peptides of 1=1. and a Irmitrd number of duplicates) was which a very large Horary of peptides on phage generated by 

dearly evident in a fluorescence scan following side group recombinant DNA methods was screened for binding to 

deprotection and treatment with the antibody 3E7 and fluo- antibody 3E7 (see CwirU et aL f Proc. Nad. Acad. Sd. USA, 

resonated antibody. Each synthesis sits was a 400x400 um 23 (1990) 87:6378, incorporated herein by reference). Addi- 

squarc. tional binary syntheses based on leads from peptides on 

The scan showed a range of fluorescence intensities, from phage experiments show that YGAFMQ (SEQ. ID No: 18), 

a background value of 3300 counts to 22,400 counts in the YGAFM (SEQ. ID No:19), and YGAPQ (SEQ. ID No:20) 

brightest square (x=20. y=9). Only 15 compounds exhibited give stronger fluoresce dcc signals than does YGGFM, the 

an intensity greater than 12300 counts. The median value cf 30 immunogen used to obtain antibody 3E7. 

the array was 4.800 counts. Variations on the above masking strategy win be valuable 

The identity of each peptide in the array could be deter- in certain circumstances. For example, if a "kernel " 

mined from its x and y coordinates (each range from Oto 31) sequence of interest consists of PQR separated from XYZ 

and the map of FIG. 10. The chemical units at positions 2, and mat the aim is to synthesize peptides in which these 

5. 6. 9. and 10 are specified by the y coordinate and those at 35 units are separated by a variable number of different 



positions 1, 3. 4. 7, 8 by the x coordinate. All but ooe of the 
peptides was shorter than 10 residues. For example, the 
peptide at x=12 and y=3 is YGAGF (SEQ. ID No3) 
(positions 1, 6, 8. 9. and 10 are nulls). YGAFLS (SEQ. ID - 
No:4), the brightest element of the array, is at x=20 and y=9. 40 

It is often desirable to deduce a binding affinity of a given 
peptide from the measured fluorescence intensity. 
Conceptually, the simplest case is ooe in which a single 
peptide binds to a univalent antibody molecule. The fluo- 
rescence scan is carried out after the slide is washed with 45 
buffer for a defined time. The order of fluorescence inten- 
sities is then a measure primarily of the relative dissociation 
rates of the anribody-peptide complexes. If the on-raxe 
constants are the same (e.g., if they are dffiision-controUed), 
the order of fluorescence intensities will correspond to the so 
order of btnA'ng affinities. However, the situation is some- 
times more complex because a bivalent primary antibody 
and a bivalent secondary antibody are used. The density of 
peptides in a synthesis area occrespooded to a mean sepa- 
ration of -7 nm. which would allow multivalent antibody- 55 

peptide interactions. Hence, fluorescence intensities ^ 
obtained according to the method herein will often be a 

qualitative indicator of binding affinity. The products are ACEG, ACFG, ADEG. ADFG, BCEH. 

Another important consideration is the fidelity of synthe- BCFH, BDEEL and BDFH. A and G always appear together 
-6LS. Deletions arc produced by incomplete photodeprotection 60 because their additions were directed by the same mart, and 
or incomplete coupling. The coupling yield per cycle in likewise for B and H. 
these experiments is typically between 85% and 93%. C. Linker Selection 

Implementing the switch matrix by masking is imperfect Aocnrding tn j rrff nr^ rm^o***™*"** the linker molecules 
because of light diffraction, internal reflection, and scatter- used as an intermediary between the synthesized polymers 
ing. Consequently, stowaways (chemical units that should 63 and the substrate are y*l*fV»? for optimum length and/or 
not be on board) arise by unintended iUumination of regions type for improved binding interaction with a receptee 
that should be dark. A binary synthesis array contains many According to this aspect of the invention diverse linkers of 



residues, then the kernel can be placed in each peptide by 
using a mask that has l*s everywhere. The polynomial 
representation of a suitable synthesis is: 

(FXQXRXA+«X»+«XC4«XI>+«XX)00(Z) 

Sixteen peptides will be formed, ranging in length from the 
6-mer PQRXYZ to the 10-mer PQRABCDXYZ. 

Several other ^^'"g strategies will also find value in 
selected circumstances. By using a particular mask more 
than once, two or more rcactants will appear in the same set 
of products. For example, suppose that the mask for an 
S-stcp synthesis is 



a 11110000 
B 00001111 
c 11001100 
D 00110011 

E 10101010 
F 01010101 
O 11110000 
K 00001111 
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varying length and/or type are synthesized for subsequent carboxyl group of an amino acid, and the nature of the 

attachment of a ligand. Through variations in the length and chemical synthesis will dictate which reactive grocp will 

type of linker, it becomes possible to cptimire the binding require a protecting group. Analogously, attachment of a 

interaction between an immobilized ligand and Us receptor protecting group to the S'-oydroxyl group of a nucleoside 

The degree of binding between a ligand (peptide. 3 during synthesis using foe example, poospru^-triester cou- 

inhibitor, hapten, drug, etc) and its receptor (enzyme. P^S chemistry, prevents the 5-hycxoxyi of ooe nucleoside 

antibody, etc) when one of the partners is immobilized on ^1^^ ^ thC 3 - activaXcd 'Pbosptotc-triester of 

to a substrate will in some embodiments depend on the mother. ^ ' . 

accessibility of the receptor in solution to the imrnobilized R f^ 1 « s of mc s P c ? hc U5C - ««ing groups arc 

i. j ^ nJtI*T- . „ -« , A , n employed to protect a moiety on a molecule from reacting 

hgand, the acces^ty in mrn wffl o^end o^tof* to ^ ^ a ^ gcnL groups of the present iW 

and/or type of linker; molecule employed tojmmobihze ooe ^ ^ mc ^^^^ prevent selected 

of the partners Referred ^diments of the mvenoon m from m0< ^ in ^ to which they are 

therefore employ the ULSIPS™ technology described ^ m 3tablc (mat bf mcy remain attached to the 

herein to generate an array of, preferably, inactive or inert molecule) to the synthesis reaction conditions; thev are 

linkers of varying length and/or type, using photochemical 15 removable under conditions mat do not adversely affect the 

protecting groups to selectively expose different regions of rrm^imn; structure; and once removed, do not react apprc- 

the substrate and. to build upon chemically-active groups. . dably with the surface or surface-bound oligomer. The 

• In the simplest embodiment of this concept, the same unit selectioQ of a suitable protecting group will depend, of 

is attached to the substrate in varying multip les or lengths in course, on the chemical nature of the monomer unit and 

known locations on the substrate via VLSIPS™ techniques 20 oligomer, as well as the specific reagents they are to protect 

to generate an array of por/mers of varying length. A single against 

ligand (peptide drug, hapten, etc) is attached to each of In a preferred embodiment, the protecting groups are 

them, and an assay is performed with the binding site to photoactivatable. The properties and uses of pbotoreactive 

evaluate the degree of binding with a receptor that is known protecting compounds have been reviewed. See, McCray ct 

to bind to the ligand. In cases where the linirr length 23 ^ RcVm j Bicphys. and Biophys, Chm. (1989) 

impacts the ability of the receptor to bind to the Hgand, 18:239-270, which is incorporated herein by reference, 

varying levels of binding wffl be observed. In general, the Preferably; the photosensitive protecting groups wfll be 

linker which provides the highest binding will then be used removable by radiation in the ultraviolet (UV) or visible 

to assay other Ugands synthesized in accordance with the . portion of the electromagnetic spectrum. More preferably, 

techniques herein. 30 the protecting groups will be removable by radiation in the 

According to other embodiments the binding between a Dear tjv or visible portion of the In some 

single Ugand/reccfxor pair is evaluated for linkers of diverse embodiments, however, activation may be performed by 

monomer sequence. According to these embodiments, the other methods such as localized heating, electron beam 

linkers are synthesized in an array in accordance with the lithography, laser pumping,- oxidation or reduction with 

techniques herein and have different monomer sequence 35 microdectrodes; and the like. Sulfonyl compounds are suit- 

(and, optionally, different lengths). Thereafter, all of the ^ reactive groups for electroo beam lithography. Oxidi- 

linkrx molecules are provided with a ligand known to have trvc or reductive removal is accomplished by exposure of the 

at least some binding affinity for a given receptor. The given protecting group to an electric current source, preferably 

receptor is then exposed to the ligand and binding affinity is using microelectrodes directed to the predefined regions of 

deduc e d. Linker molecules which provide adequate binding 40 ^ c Jur / loc which are desired for activation. Other methods 

between the ligand and receptor arc then utili7rd in screen- niay be used in light of this disclosure, 

ing studies. Many, although not all. of the photoremovable protecting 

D. Protecting Groups groups will be aromatic compounds that absorb ncar-UV and 

As discussed above, selectively removable protecting visible radiation. Suitable photoremovable protecting 

groups allow creation of well defined areas of substrate 45 group* arc described in, for example, McCray et aL, 

surface having differing reactivities. Preferably, the protect- Paichornik, J. Amcr. Chem. Sec. (1970) 92 £333. and Anrit 

ing groups ire selectively removed from the surface by ctaL, J. Org. Chem. (1974) 39:192, which are incorporated 

applying a specific activator, such as electromagnetic radii- herein by reference. 

tion of a specific wavelength and intensity. More preferably, a preferred dass of photoremovable protecting groups 

the specific activator exposes selected areas of surface to 50 ^ mc general formula: 
remove the protecting groups in the exposed areas. 

Protecting groups of the present invention are used in 
conjunction with solid phase oligomer syntheses, such as 
peptide syntheses using natural or unnatural amino acids, 

nucleotide syntheses using deoxyribonucleic and ribo- 55 
nucleic acids, oligosaccharide syntheses, and the like. In 
addition to protecting the substrate surface from unwanted 
reaction, the protecting groups block a reactive end of the 
monomer to prevent self-polymerization. For instance, 

attachment of a protecting group to the amino trxminus of an 60 where R\ R 3 . R\ and R 4 independently are a hydrogen 

activated amino acid, such as an N-by droxysn ccinimidc- atom, a lower alkyl, aryl. benzyl, halogen, hydroxyl. 

activated ester of the amino acid, prevents the amino termi- alkoxyl. thiol, thioether. amino, nitro, carboxyL formate, 

nus of one monomer from reacting with the activated ester forma mi do or phosphide group, or adjacent substituents 

portion of another during peptide synthesis. Alternatively, (It, R l -R\ R 3 -R\ R^R 4 ) are substituted oxygen groups 

the protecting group may be attached to the carboxyl group 65 that together form an cyclic aoetal or ketal; R 3 is a hydrogen 

of an amino add to prevent reaction at this site. Most atom, a alkoxyl. alkyl, hydrogen, halo, aryl. or alkenyl 

protecting groups can be attached to either the amino or the group, and n=0 or 1. 
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A jreferred jxcucting group, cVmtroveratryl (NV). which 
is used far protecting the carboxyl terminus erf an amino add 
or the hydroxyl group of a nucleotide, for example, is 
formed when R 2 and R 3 are each a melhaxygroup, R l , R 4 
and R 3 are each a hydrogen atom, and d=0: 
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A preferred protecting group, o^mtroveratrylco^caTbonyl 
(NVOQ, which is used to protect the amino tenninns of an 
amino add, for example, is formed when R 2 and R 3 are r^h 
a methoxy group. R l , R 4 and R 3 are each a hydrogen atom, 
and n=l: 



Another most preferred protecting group, methyl-6- 
niD^veratryloxycarbonyl (MeNVOC). which is used to pro- 
tect the amino terminus of an amino add. for example, is 
fanned when R 2 and R 3 are each a methoxy group. R l and 
15 R 4 are each a hydrogen atom. R 3 is a methyl group, and n=l: 
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Another preferred protecting group, 6-nitropiperouyl 
(NP), which is used for protecting the carboxyl terminus of 
an amino add or the hydroxy! group of a nucleotide, for 
example, is formed when R 1 and R 3 together form a meth- 
ylene acetal. R\ R 4 and R 3 arc each a hydrogen atom, and 
n=0: 



30 



t 

Another most preferred protecting group, methyl-6- 
nitropg>eronyl (MeNP), which is used for protecting the 
carboxyl terminus of an amino add or the hydroxyl grocp of 
a nudeotidc, for example, is formed when R 2 and R 3 
together farm a methylene acetal. R 1 and R 4 are r*^ a 
hydrogen atom, R 5 is a methyl group, and n=0: 
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Another preferred protecting group. 
o^mtropipcroDyloxycarbonyl (NPOQ. which is used to pro- 
tect the amino terminus of an amino add, for example, is 
formed when R 2 and R 3 together form a methylene acctaL 
R 1 , R 4 and R 3 are each a hydrogen atom, and n=l: 
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Another most preferred protecting group, methyl-6- 
nitrop ip uonylaxycarfaonyl (MeNPOQ, which is used to 
43 protect the amino terminus of an amino add, for example, is 
formed when R a and R 3 together form a methylene ?~t»\ 
R l and R 4 arc each a hydrogen atom, R 3 is a methyl group, 
and n=l: 
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A most preferred protecting group, rnemyl-6-nitroveratryl 
(MeNV), which is used for protecting the carboxyl terminus 
of an amino add or the hydroxyl group of a nucleotide, for 
example, is formed when R 3 and R 3 are each a methoxy 
group, R l and R 4 are each a hydrogen atom, R 3 is a methyl 
group, and n=-0: 



gj A protected amino add having a phocoactivatable cxy- 
carbonyl protecting group, such KVOC or NPOC or their 
corresponding methyl derivatives, MeNVOC or MeNPOC. 
respectivdy, on the amino terminus is formed by acyUting 
the amine of the, amino add with an activated axvcarbonyl 

63 ester of the protecting group. Examples of activated oxy- 
carbonyl caters of NVOC and MeNVOC have the general 
formula: . 
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wbcrc X is halogen, hydroxyl, tosyl, mcsyl, trifluocmctfayL 
ctiizn, an' do, and the Kxe. 

Another method for generating protected monomers is to 
react the bcnzyHc alcohol derivative of the protecting group 
with an activated ester of the monomer. For example, to 
protect the carboxyl terminus of an amino add. an activated 
ester of the amino add is reacted with the alcohol derivative 
of the protecting grccp. such as o^nrtroveratrol (NVOH). 
Examples of activated esters suitable for such uses indude 
halo-formate, mixed anhydride, imirtaroyl formater acyl 
halide, and also induces formation of the activated ester in 
situ the use of »'" i " inn reagents such as DCC ud the like. 
See Albert oq et aL for c<her examples of activated esters. 




where Y is a halogen atom, a tosyi. mesyi. trifluoromethyL 
10 azido. or diazo group, and the i*v* 

Another class of preferred photochemical protecting 
groups has the formula: 
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where X is halogen, mixed anhydride, phenoxy, 
p-mtrophesoxy, N^araxysucdnimide. and the like. 

A piotecfed amino add or nndcotide having a photoac- 
tivatable protecting group, such as KV or NP or their 
corresponding methyl derivatives, MeNV or MeNP, 
respectrvdy, on the c ar boxy tpmrimn of the amino acid or 
^-hydroxy terminus of the nudeoddc. is formed by acylat- 
ing the carboxy tenniaiis or ^-OH with an activated benzyl 
derivative of the p r otect ing grccp. Examples of activated 
benzyl derivatives of MeNV and McNP have the general 
formula: 
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23 where R\ R 2 , and R 3 independently arc a hydrogen atom, a 
lower alkyl aryl, benzyL halogen, hydroxyi, alkoxyl, thiol, 
thioether, amino, nitro, carboxy 1, formate, formamido, 
sulfanaics, sulfide or pbosphido group, R 4 and R 5 indepen- 
dently are a hydrogen atom, an alkoxy. alkyL halo, aryl, 

M hydrogen, or allcenyl group, and n=0 or 1. 

A preferred protecting group, 
1-pyrcnylmethyloxycaTbonyl (PyROQ, which is used to 
protect the amino terminus of an amino add, for example, is 
formed when R 1 through R 3 are each a hydrogen atom and 

35 n=l: 
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Another preferred protecting group, 1-pyrenylmethyl 
(PyR), which is used for protecting the carboxy terminus of 
an amino add or the hydraxyl group of a nudeotide, for 
example, is formed when R l through R 3 are each a hydrogen 
atom and n=0: 



55 




A further method for generating protected monomers is to 
.react the benzylic alcohol derivative of the protecting group 60 
with an activated carbon of the monomer. For example, to 
protect the 5'-hydroxyl group of a nuddc acid, a derivative 
having a ^'-activated carbon is reacted with the alcohol 
derivative of the protecting group, such as methyl-6- 

nitropiperoaol (MePyROH). Examples of nucleotides hav- 63 tecting group on its amino terminus is formed by acytation 
ing activating groups attached to the 5*-fcydraxyl group have of the tree amine of amino add with an activated oxycar- 
the general formula: bony! ester of the pyrenyl protecting group. Examples of 



An amino add having a pyrenybiiemyloxycarboQyl pro- 
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activated oxycaTbonyl esters of PyROC have the general 
formula: 





where X is a halogen atom, a hydroxy I, diazo, or azido 
group, and the like. 

Another method of generating protected monomer* is to 
react the pyrenylmethyl alcohol moiety of the protecting 
group with an activated ester of the monomer. For example, 
an activated ester of an amino acid can be reacted with the 
alcohol derivative of the protecting group, such as pyrenyl- 
methyl alcohol (PyROH), to form the protected derivative of 
the carboxy terminus of the amino acid. Examples of acti- 
vated esters include halo-formate, mixed anhydride, imida- 
zoyl formate, acyl halidc, and also includes formation of the 
activated ester in situ and the use of common reagents such 
as DCC and the like. 

a early, many photosensitive protecting groups are suit- 
able for use in the present invention. 

In preferred embodiments, the substrate is irradiated to 
remove the pbotoremovaWe protecting groups and create 
regions having free reactive moieties and side products 
resulting from the protecting group. The removal rate of the 
protecting groups depends on the wavelength and intensity 
of the incident radiation, as well as the physical and chemi- 
cal properties of the protecting group itself. Preferred pro- 
tecting groups are removed at a faster rate and with a lower 
intensity of radiation. For example, at a given set of 
conditions. MeNVOC and MeNPOC arc photolytically 
removed from the N -terrain us of a peptide chain faster than 
their onsubstituted parent compounds, NVOC and NPOC, 
respectively. 

Removal of the protecting group is accomplished by 
irradiation to liberate the reactive group and degradation 
products derived from the protecting group. Not wishing to 
be bound by theory, it is believed that irradiation of an 
NVOC- and MeNVOC-protected oligomers occurs by the 
following reaction schemes: 
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where X is halogen, or mixed anhydride, p-mtrophenoxy. or 15 
N-hydtarysuctinimide group, and the like, 

A protected amino acid or nucleotide having a photoac- 
tivatable protecting group, such as PyR, on the carboxy 
trrmimi< of the amino acid or 5*-hydraxy terminus of the 
nucleic acid, respectively, is formed by acylating the car- 20 
boxy terminus or 5*-OH with an activated pyrenylmethyl 
derivative of the protecting group. Examples of activated 
pyrenylmethyl derivatives of PyR have the general formula: 
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^^^OC-AA^3,4.oUmethoxy-6-nit^o3obenzaldehyde+ 
CCh+AA 

MeNVOC-AA^3.4--dimcthoxy^ 
C0 2 +AA 

where AA represents the N-terminus of the amino acid 
oligomer. 

Along with the unprotected amino acid, other products are 
liberated into solution: carbon dioxide and a 23-dimethoxy- 
• o^nitrosophenylcarbonyl compound, which can react with 
nudeophHic portions of the oligomer to form unwanted 
secondary reactions. In the case of an NVOC-protected 
amino acid, the degradation product is a 
nilrosobe ozaldehyde. while the degradation product for the 
other is a mcrosophenyl ketone. For instance, it is believed 
that the product aldehyde from NVOC degradation reacts 
with free amines to forma Schiff base (imine) that affects the 
remaining polymer synthesis. Preferred photorcmovable 
rxececting groups react slowly or reversibly with the oligo- 
mer on the support. 

Again sot wishing to be bound by theory, it is believed 
that the product ketone from irradiation of a MeNVOC- 
protected oligomer reacts at a slower rate with nudeophiles 
on the oligomer than the product aldehyde from irradiation 
of the same NVOC-protected oligomer. Although not unam- 
biguously determined, it is believed that this difference in 
reaction rate is due to the difference in general reactivity 
between aldehyde and ketones towards nudeophiles due to. 
stoic and electronic effects. 

The photoremovable protecting groups of the present 
invention arc readily removed For example, the photolysis 
of N-protected L-pfacnylilflninc in solution and having dif- 
ferent photoremovable protecting groups was analyzed, and 
the results are presented in the following table: 

TABLE 





Fbcwjhw of Protected L-Ft* — OH 












Sohrexx 


NBOC 


NVOC MeNVOC 


MeNPOC 


Diouce 


12SS 


110 24 


19 


5mMH,SO, 


/Dicme 1575 


98 33 


72 



The half life, tl/2> is the time in seconds required to 

45 remove 50% of the starting amount of protecting group. 
NBOC is the o^rn^robenzyloxycarbonyl group, NVOC is the 
6-mtroverarryloxycarbonyl group, MeNVOC is the methyl- 
o^nirroveratryloxycarbonyl group, and MeNPOC is the 
rncthyl-o^niGopipcroQyloxycirboayl group. The photolysis 

50 was earned out in the indicated solvent with 362/364 
■ nm-wavelength irradiation having an intensity of 10 
mW/an 1 , and the concentration of each protected phenyla- 
lanine was 0.10 mM. 
The table shows that deprotection of NVOC-, MeNVOC-, 

55 and MeNPOC-protected phenylalanine proceeded faster 
than the deprotection of NBOC Furthermore, it shows that 
the deprotection of the two derivatives that are substituted 
on the benzylic carbon. MeNVOC and MeNPOC were 
photolyzed at the highest rates in both diooune and acidified 

60 dioxanc 

1. Use of Photoremovable Groups During Solid-Phase 
Synthesis of Peptides 

The formation of peptides on a solid-phase support 
reqinrcs the stepwise attachment of an amino acid to a 
65 substrate-bound growing chain. In order to prevent 
unwanted polymeri z ation of the monomeric amino acid 
under the reaction conditions, protection of the amino ter- 
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nanus of the amino add is required. After the monomer is - 
ooup<M to the end of the peptide, the N-terrninal protecting 
groop is removed, and another amino acid is coupled to the 
rh«?n This cycle of coupling and depro leering Is continued 
for each amino acid in the peptide sequence. See Merxifidd. 3 
7. Am. Chem. Soc. (1963) &52149. and Athenon et aL, 
•*SoHd Phase Peptide Synthesis". 1989. IRL Press. London, 
both incorporated herein by reference for all purposes. As 

described above, the use of a pbotorcmovable protecting B is the base attached to. the sugar ring; R is a 

group allows removal of selected portions of the substrate 10 m ^ **** 3 s ^^ose or R is a 

surface via patterned irradiation, during the deprotection ^oup when the sugar u nbo«; P reprcsenxs an 

sunacc, na p««™~ u ~^ * ~~. * 7\. activated phosphorous group; and X is a photoremovable 

cycle of the solid phase synthesis. This selectively allows ^^^^ group. The photoremovabk ptc<ecting group. X. 
spatial control of the synthesis— the next amino aad is ^ preferably NV, NP, PyR. MeNV. MeNP. and the like as 
coupled only to the irradiated areas. l3 described above. The activated phosphorous group, P. is 

In one embodiment, the photoremovable protecting . preferably a reactive derivative having a high coupling 
aroups of the present invention are attached to an activated efficiency, such as a phasphate-triestex, phosphor ? mirii te or 

♦ ~t ... ™T„~ «* ^ -»tY,;n^ f ir. m- the like. Other activated phosphorous derivatives, as well as 

eater of an amino aad at toe amino terminus. 1 \~_ /0 ^ . x 

reaction conditions, are well known (See Gau). 

Y hh_ x 20 ^ Amino Arid N-Carbaxy Anhydrides Protected With a 

T Photoremovable Group 
R During Merrifirlrl peptide synthesis, an activated ester of 

one amino add is coupled with the free amino terminus of 
a substrate-bound oligomer. Activated esters of amino adds 
where R is the side chain of a natural or unnatural amino suitable for the solid phase synthesis include halo-formttr, 
X is a photoremovable protecting group, and Y is an 25 mixed anhydride, imidazoyl formate, acyl halide, and also 
activated carbaxylic add derivative. The photoremovable includes formation of the activated ester in situ aad the use 
protecting groop. X is preferably NVOC NPOC FyROC of common reagents such as DCC and the like (Sec Alberton 
MeNVOC MeNPOC and the like as discussed above. The et aL). A preferred protected anact activated amino acid has 
activated ester, Y. is preferably a reactive derivative having the general fo rmula : 
a high coupling efficiency, such as an acyl halide, mixed 

anhydride, N-hycVoxysucciniraide ester, perfluoropbenyl ^° 
ester, or methane protected add. and the like. Other acti- ^s^^A 
vated esters and reaction conditions are well known (See | Q 

Amerton et al.). 35 xo^ ^ n 

2. Use of Phctorcmovable Groups During Solid-Phase 
Synthesis of Oligonucleotides 

The formation of oUgooudeotides on a solid-phase sup- 
port requires the stepwise attachment of a nudeotide to a where R is the side chain of the amino acid aad X is a 
substrate-bound growing oligomer. In order to prevent 40 photoremovable protecting group. This compound is a 
unwanted polymerization of the monomelic nudeotide urethancsprocected amino add having a photocmovable 
under the reaction conditions, protection of the y-bydroxyl Protecting group attach to the axnine, A more preferred 
group of the nucleotide is required. After the monomer is « formed when the pho^rernovable 

coupied to the end cf the oligomer, the 5-bydroxyi protect- 4J I****** mc foaD ^ 

ing group is removed, and another nucleotide is coupled to 
the **h»j" This cyde of coupling and deprotecting is con- 
tinued for each nucleotide in the oligomer sequence. Sec 
Gait, Oligonucleotide Synthesis: A Practical Approach*' 
19&4. IRL Press, London, mcoroorated herein by reference 30 
for all purposes. As described above, the use of a photore- 
movable protecting group allows removal, via patterned 
irradiation, of selected portions of the substrate surface 

daring the deprotection cyde of the solid phase synthesis. where R\ R 2 . R\ and R* independenily arc a.h ydroge n 
This sdectivdy allows spatial control of the synthesis-thc « atom, a lower alkyU aryl. benzyl, halogen, hydroxy L, 
next nudeotide is coupled only to the irradiated areas. **axyl. thiol, thioethcr, Amino, nitro, carboxyU f ocrnate, 

„ ... . . « • 1 !• fotmamido or pbosphido group, cr adjacent substxtoeats 

OUgonudeoade synthesis generally » (Le .. R ». R a R^^-R^are substituted oxygen grotps 

activated phosphorous derivative on the 3 -hydroxyl group ^ fcm \ cyclic acetal or kctal; and R* is a 

of a nudeotide with the S'-hydroxyl group of an oligomer ^ hvdrogcn M ^^yi ^ U hydrogen, halo, aryU or 
bound to a solid support. Two major chemical iriethods exist aifcenyl group, 

to perform this coupling: the pbosphate-triester and phos- A p^f^^ activated amino acid is formed when the 
fAcomidite methods (See Gait). Protecting groups of the photoremovable protecting group is 
present invention are suitable for use in dther method. 6-nittOYcratryloxycarbonYi. That is, R l and R 4 arc each a 

In a preferred embodiment a photoremovable protecting 63 hydrogen atom, R 5 and R 3 arc each a metboxy group, and 
group is attached to an activated nudeotide on the R 3 is a hydrogen atom. Another preferred activated anrino 
^-hydroxyl group: *dd is formed when the photoremovable group is 



O 
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o^mtrqoiperonyl: R l and R 4 arc each a hydrogen atom. R 3 
and R together form a methylene a octal, and R 3 is a 
hydrogen a t om. Other protecting ' groups are possible. 
Another preferred activated ester is formed when the pho- 
torcmovable group is methyl-^- nitrov era tryl or inethyl-o- 
mtropiperonyl. 

Another preferred activated amino acid is formed when 
the photorempvable protecting group has the general for- 
mula; 




where R 1 , R 2 , and R 3 independently are a hydrogen atom, a 
lower aOcyl aryL benzyl, halogen, hydroxyl, alkaxyL thiol, 
thioether. amino, nitro, carboxyl, formate, fonnamido. 
sulfimatr.s. sulfide or pborphido group, and R 4 and R 3 
independently are a hydrogen atom, an alkoxy. alkyl, halo, 
aryL hydrogen, or aOcenyl group. The resulting compound is 
a tnxmane-protected amino acid having a pyrenyimethy- 
loxycarbonyl protecting group attached to the amine. A more 
preferred embodiment is formed when R 1 through R 3 are 
each a hydrogen atom. 

The urethanc -protected amino acids having a pbotore- 
movable protecting group of the present invention are pre- 
pared by condensation of an N-protected amino acid with an 
acylating agent such as an acyl haUrtr, anhydride, chloro- 
f annate and the like (Sec Fuller et aL, U.S. Pat. No. 
4,946542 and Fuller et aL, J. Amer. Chan. Soc. (1990) 
112:7414-7416, both herein incorporated by reference for 
all purposes). 

Ure thane-protected amino acids having photorcmovable 
protecting groups are generally useful as reagents during 
solid-phase peptide synthesis, and because of the spatially 
selectivity possible with the photorcmovable protecting 
group, arc especially useful for the spatially addressable 
peptide synthesis. These amino acids are difunctional: the 
urethanc group first serves to activate the carbaxy trrmimn 
far reaction with the amine bound to the ipirfacr and, once 
the peptide bond is formed, the pbotore movable protecting 
group protects the newly formed amino terminus from 
further reaction. These amino adds are also highly reactive 
to nudeopfailes, such as deprotected «mw< on the surface 
of the soUd support, and dne to this high reactivity, the 
solid-phase peptide coupling times are significantly reduced, 
and yields are typically higher. 

IV. Data Collection 
A. Data Collection System 

Substrates prepared in accordance wfth the above descrip- 
tion are used in one embodiment to determine which of the 
plurality of sequences thereon bind to a receptor of interest. 
FIG. 11 illustrates one embodiment of a device used to 
detect regions of a substrate which contain flourescent 
markers. This' device would be used, for example, to detect 
the presence or absence of a hhrlffrf receptor such as an 
antibody which has bound to a synthesized polymer on a 
substrate. 

Light is directed at the substrate from a light source 1002 
such as a laser light source of the type well known to those 
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of skill in the art such as a model no. 2025 n**^* by Spectra 
Physics. Light from the source is directed at a lens 19*4 
which is preferably a cylindrical lens of the type well known 
to those of skill in the art. The resulting output from the lens 
10W is a linear beam rather than a spot of light, resulting in 
the capability to detect data substantially simultaneously 
along a linear array of pixels rather than on a pixel-by-pixel 
basis. It will be understood that a cylindrical lens is used 
herein as an illustration .of one technique for generating a 
■ linear beam of light on a surface, but that other techniques 
could also be utilized. 

The beam from the cylindrical lens is passed through a 
dichroic mirror or prism (1946) and directed at the surface 
of the suitably prepared substrate 1998. Substrate 1998 is 
placed on an x-y translation stage 1909 such as a model do. 
PM5004 made by Newport. Light at certain locations on the 
substrate will be fluoresced and transmitted along the path 
indicated by dashed Hues back through the dichroic mirror, 
and focused with a suitable lens 1919 such as an VIA 
camera lens on a linear detector 1912 via a variable f stop 
focusing lens 1914. Through use of a linear light beam, it 
becomes possible to generate data over a line of pixels (such 
as about 1 cm) along the substrate, rather than from indi- 
vidual points on the substrate. In alternative embodiments, 
light is diluted at a 2-drmrn*ional area of the substrate and 
fluoresced light detected by a 2-dzmensional CCD array. 
T inrif detection is preferred because substantially higher 
power dcasirira are obtained. 

Detector 1912 detects the amount of light fluoresced from 
the substrate as a function of position. According to one 
embodiment the detector is a linear CCD array of the type 
commonly known to those of akin in the art. The x-y 
translation stage, the light source, aad the detector 1912 are 
all operabty connected to a computer 1916 such as an IBM 
PC-XT or equivalent for control of the device and data 
collection from the CCD array. 

In operation, the substrate is appropriately positioned by 
the t r a nsiti on stage. The light source is Chen illuminated, 
and intensity data are gathered with the computer via the 
detector. 

FIG. 12 illustrates the architecture of the data collection 
system ia greater, drtail. Operation of the system occurs 
under the direction of the photon counting p rogr a m 1192 
(photon), included herewith as Appendix B. The user inputs 
the scan dimensions, the number of pfr^l* or data points in 
a region, and the scan speed to the counting program. Via a 
GPLB bus 1194 the program (in an IBM PC compatible 
computer, for example) interfaces with a mnrtrhannd scaler 
1196 such as a Stanford Research SR 430 and an x-y stage 
controller 1198 such as a PM500. The signal from the K ght 
from the fluorescing substrate enters a photon counter 1119. 
providing output to the scaler 11+6. Data arc output from the 
scaler indicative of the number of couats in a given region. 
After snnhing a selected area, the stage controller is acti- 
vated with commands for acceleration and vdodty, which in 
turn drives the scan stage 1112 such as a PM500-A to 
another region. 

Data arc collected in an image data file 1114 and pro- 
cessed in a scaling program 1116, also induced in Appendix 
B. A scaled image is output for display on, for example, a 
VGA display 1118. The image is scaled based on an input of 
the percentage of pixels to cup and the minimum and 
marirraim pixd levels to be viewed The system outputs for 
use the min and max pixd levels in the raw data. 
B. Data Analysis v 

The output from the data collection system is an array of 
data indicative of fluorescent intensity versus location on the 
substrate. The data are typically taken over regions substan- 
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tiiUy smaller than the area in which synthesis of i given At step 1312 the system then integrates me dou withm the 
polymer has taken place. Merely by way of example, if bandwidth fcr each of th* selected ceib. som me data at step 
polymers were synthesized in squares on the substrate 1314 using the synthesis procedure file, and displays the data 
having dimensions of 500 microns by 500 microns, the data t0 a user on. for example, a video display or a printer, 
may be taken over regions having dimensions of 5 microns 5 

by 5 microtis. In most ptefm e d embodbients. the regions • v - Representative Applications 

over which flourescence data are taken across the substrate A. Oligonucleotide Synthesis 

are less than about Vi the area of -the- regions. in which The .generaHry, of' light directed spatially- addressable 
individual porymers are synthesized, preferably less man Vio parallel chemical synthesis is demonstrated by application to 
the area in which a single polymer is synthesized, and most io nucleic add synthesis, 
preferably less than Vioo the area in which a single polymer 1. Example 

is synthesized Hence, within any area in which a given Light activated formation of a thyim din ecyri dine dimer 
polymer has been synthesized, a Urge rum her of flucres- w ** carried out. A three dimensional representation of a 
cence data points arc collected- fluoresce nee scan showing a checkerboard pattern generated 

A plot of number of pixels versus intensity for a scan of is ^ mc light-directed synthesis of a dinudeotide is shown in 
a cell when it has been exposed to. for example, a labeled &. 5'-oitroveratryl thymidine was attached to a syntbe- 

antibody will typically take the form of a bell curve, but sut ^ imc through the 3' hydroxy 1 group. The nitroveratryi 
spurious data arc observed, particalairy at higher in tensities, protecting groups were removed by iDuminatioa through a 
Since it is desirable to use an average of fluorescent intensity ^^0 111111 c^cctoboard mask. The substrate was then treated 
over a given synthesis region in ^^'"g relative binding - ^ phosphoramidite activated 2^deoxy cytidine. Id order to 
affinity, these spurious data will tend to undesirably skew the follow the reaction flucxometric&Uy, the deoxycytidine had 
data. been modified with an FMOC protected aminobexyl HnVn- 

Accordingly, in one ernbodimeitt of the invention the data attached to ^ exocycEc amine (5^0-dm>ethoxytrityl-^N- 
are corrected for removal of these spurious data points, and (^N-fiuorenylmethylcajbamoyl-hexylcarboxy)-2 , « 
an average of the data points is thereafter utilized in deter- decay cytidine). After removal of the FMOC protecting 
mining relative KnA'ng efficiency. 25 *rcwp with base, the regioos which contained the dinude- 

FKj. 13 illustrates one embodiment of a system for otidc wcrc fl^oraceQtry labelled by treatment of the sub- 
removal of spurious data from a set of flocrescence data such stme wilh 1 mM FTTC in DMF for ooe hour, 
as data used in affinity screening studies. A user or the The tmxe-dnnensioaal representation of the fluorescent 
system inputs data relating to the chip location and cell • intensity data m FIG. 14 dearly reproduces the dtecktr- 
corners at step 1302. From this information and the image 30 b°*"l illumin ation pattern used during photolysis of the 
file, the system creates a compu ter representation of a substrate. This result demonstrates that oligonucleotides as 
histogram at step 13*4. the histogram (at least in me form of wcD « peptides can be synthesized by the light-directed 
a oompctfrr file) plotting number of data pixels versus r nrr h od, 
intensity. 

For each cell, a main data analysis loop is then performed 35 VI Conclusion 

For each cell, at step 13*6, the system «irni«t^ the total 

intensity or number of pixels for the bandwidth centered . inventions herein provide a new. approach for the 
around varying intensity levels. For example, as shown in simultaneous synthesis of a large number of compounds, 
the plot to the right of step 1346, the system raimUr^ nbe p 1 ^ 0 ** can be applied whenever one has chemical 

number of pixels within the band of width w. The system ^ blocks that can be coupled in a solid-phase format, 

then "moves" this bandwidth to a higher center intensity, and whcfl li 8 nt can be used to generate a reactive group, 

again calrul i frs the number of pixels in the bandwidth. This The above description is illustrative and not restrictive, 
process is repeated until the entire raoge of intensities has Many variations of the invention wfll become apparent to 
^nscanned. and at step 13#8 the system determines which those of skfll in the art upon review of this disclosure, 
band has the highest total number of pixels. The data within Merely by way of example, while the invention is illustrated 
this batJOwWth are used for ftmher analysis. Assuming the 43 primarily with regard to peptide and nucleotide synthesis, 
b*ndwidth is selected to be reasonably small, this procedure the invention is not so limited. The scope of the invention 
wfll have the effect of elimin a ti ng spurious data located at should, therefore, be determined not with reference to the 
^tensity levels. The system then repeats at step above description but instead should be determined with 
1310 Lf all cells have been evaluated, or repeats for the next reference to the appended claims along with their full scope 
°cU* of equivalencs. 
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FIG. 5 
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ARRAYS OF NUCLEIC ACID PROBES ON previously characterized sequence or reference sequence. 

BIOLOGICAL CHIPS The methods of the invention can be used to delect varia- 

tions betweeo a target and reference sequence, includioe 

CROSS-REFERENCE TO RELATED single or multiple base substitutions, and deletions and 

.. APPLICATION . . s insertions of bases, as well as detecting the presence, 

■ This is a Continuation of application Ser. No 08/143 3P l *- itioa > a f d of pother more complex variations 

filed Oct. 26, 1993. now abandoned, which ^JS^S^S^T^^^^ 

in part of U.S. patent application Ser. No. 082.937, filed 25 J£l P '*ln£* T pf0VI ^ S ° f .£iS<>™cleoude 

Ju. 1993. now abandoned, incorporated herein bv refer- jq %5Z£?SE& ^po^sing 

n ' . , Jt f . . VLSIPS™ technology, but other synthesis methods and 

Kcsearcn leading to the invention was funded in part by immobilization of pre-synthesized oligonucleotide probes 

NIH grant No. 1R01HG00813-01 and DOE grant No. can be used to make the oligonucleotide probe arrays, called 

DE-FG03-92-ER81275, and the government may have cer- "DNA chips", of the invention. In general, thesearrays 

tain rights to the invention. comprise a set of oligonucleotide probes such that, for each 

base in a specific reference sequence, the set includes a 

BACKGROUND OF THE INVENTION probe (called the "wild-type" or U WT probe) that is exactly 

1. Field of the Invention complementary to a section of the reference sequence 
^ nrMMt , f . • including the base of interest and four additional probes 
The present invention provides arrays of oligonucleotide (caUed "substitution probes"), which are identical to the WT 

probes immobilized in microfabricated patterns on silica 20 probe except that the base of interest has been replaced by 

chips for analyzing^ molecular interactions of biological one of a predetermined set (typically 4) of nucleotides. In the 

interest. The invention therefore relates to diverse fields preferred embodiment, one of the four substitution probes is 

impacted by the nature of molecular interaction, including identical to the wild type probe; the other three are comple- 

cbemistry, biology, medicine, and medical diagnostics. mcotary to targets that have a single-base substitution at this 

2. Description of Related Art position. 
Oligonucleotide probes have long been used to detect 1°*?°*^ 

complementary nucleic acid sequent in a nucleic add o SLlS « »StS«n * \<T C ?°* mCDt '. tbc 

*?££S£ n tH^ In j° ffiC T y for r ' EST itj^^i^ 

!^^^^ ProbC K 1 VCd ' ^- by C ° VaICDt 30 cent positions in me reference sequence are also adjacent to 

attachment, to a sohd support and arrays of oligonucleotide one another on the chip. One method arranges the probls for 

probes immobilized on sohd supports have been used to : a single base in a short column (alternately row) and . 

detect specific nucleic acid sequences in a target nucleic arranges the coluhins in the order of the base position to " 

acid. See, e.g., PCT patent publication Nos. WO 89/10977 form horizontal (alternately vertical) stripes. The wild-type 

and 89/11548. Others have proposed the use of large num- and each of the substitution probes have specified positions 

bers of oligonucleotide probes to provide the complete within the column so that all the probes corresponding to an 

nucleic acid sequence of a target nucleic but failed to A substitution, for example, are in a single row. The stripes 

provide an enabling method for using arrays of immobilized ma y °e separated on the chip by a blank row or column, 

probes for this purpose. See US. Pat. Nos. 5,202,231 and The DNA chips of the invention can be made in a wide 

5,002,867 and PCT patent publication No. WO 93/17126. number of variations. For some applications, leaving out the 
Hie development of VLSIPS™ technology has provided *° y^-gpe row, leaving out unimportant bases, pooling bases, 

methods for making very large arrays of oligonucleotide wcluo^g insertion and deletion probes, varying the length 

probes in very small arrays. See U.S. Pat. No. 5,143,854 and ° f . * Toh t s Wltbm . a 801 to 1 make ±c P robcs havc mc **** 

PCT patent publication Nos. WO 90/15070 and 92/1009? ° f Simflar Tm to thc tai S cl or 10 avoid s^dary 

each of which is incorporated herein by reference. U.S. « stn i cturc ' va T n ? mc m . utatl0n P 0 ^ on > *»bg mu3 tiple 

patent application Ser. No. 082,937, filed Jun. 25 1993 pr ° bCS , ""r* mutaUon ' providing rephcate probes or 

describes methods for making arrays of oligonucleotide' ™ yS > phc ?«. b if\ " st ? cts " ( ?° ? robc) bclWCCD r0W5 ' 

probes that can be used to provide the complete sequence of <; olumns ' °f «*viduil probes, and using control probes may 

a target nucleic acid and to detect the presence of a nucleic appropriate. 

acid containing a specific nucleotide sequence <n present mvention also provides DNA chips for detect- . 

Microfabricated arrays of large numbers of oligonucle- fonsis, including 

otide probes, called "DNA chips" offer great promise for a ™™ 2? ? < X ™ \ 7 ' ?• 1 V 1 ^^? 1 '? ^-TO ■ • 

wide variety of applications. New methods and reagents are »f "™ "J* 0 pr<mdcs ° N \ cb f fo < d . ctcctJn S 

required to realize this promise, and the present mvention SS^^^S^H !ST " ^ r ^ ?* 
helps meet that need « toowtoocassociatcd wthawidevanety of cancers. Other 

55 DNA chips of the invention provide probe arrays for detect- 
SUMMARY OF THE INVENTION ^2 specific sequences of mitochondrial DNA, useful for 

TL. , - - , , - - . . . . identification and forensic purposes. The invention also 

Tf" u high- provides DNA chi for J^ag specific sequences of 

""cleoU-JesormutatTorcassoc^ 

seou^i cZ.fnTJ-n ?,£ ? \ T SP ^ C DUCle ! C £ 60 dtu S rcsistlnI P°«otype in an infectious organism/such as 

^ D f v C ""I ,n t ? 1 S4m P I I f • 11,6 rifcmpicin or other drug resistant TB strafcs and HIV, in 

DNA iw„!^ T?.? T °l oh 8° nu ^ eo,lde P robcs °° which mutations in an RNApoIymerase gene are known to 

DNAchips,m which the probes have specific sequences and cive rise to dme reliance 

locations in the array to facilitate ideotffication of a specific 8 g KS]sUacc - 

target nucleic acid. In another aspect, the invention provides 65 BWEF DESCRIPTION OF THE DRAWINGS 

methods for detecting whether one or more specific FIG. 1 shows how the tiling method of the invention 

sequences of a target nucleic acid in a sample varies from a defines a set of DNA probes relative to a target nucleic acid. 
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In the figure, the target is a DNA molecule, the probes are from the genomic DNA of an individual with wild-type 

single-stranded nucleic acids 16 nucleotides in length, and AF508 sequences; in panel B, the target nucleic acid ori£ 

only a portion of the probes defined by the method is shown. natcd from a heterozygous (with respect to the AF508 

FIG. 2 shows an illustrative tiled array of the invention mutation) individual. 

, with probes for the detection of point mutations. Thc base at 5 FIG. 8, in sheets 1 and 2, corresponding to panels A and 

- . . the position of substitution in each of the wild-type probes .. B of FIG. 7, shows graphs of fluorescence intensity versus 

■ . lsshown in the wild-type lane, and .the shading shows the tiling position.- The labels on the' horizontal axis show the 

location of the substitution probe having the wild-type bases in the wild-type sequence corresponding to the posi- 

sequence. The SEQ ID. NOS. corresponding to the two tion of substitution in the respective probes Plotted arc the 
peplidescquencesshowninthetopportionofFIG.2are311 ™ intensities observed from the features (or synthesis sites) 

and 312, respectively. The SEQ ID. NOS. corresponding to containing wild-type probes, the features containing the 

the five peptide sequences listed at the bottom of FIG. 2 are substitution probes that bound the most target ("called") and 

313, 314, 315, 313, and 316, respectively. the feature containing the substitution probes that bound the 

FIG. 3, in panels A, B, and C, shows an image made from with the second highest intensity of all the substitution 
the region of a DNA chip containing CFTR exon 10 probes; 15 P robcs ("2nd Highest"). The SEQ ID NOS. corresponding to 

in panel A, the chip was hybridized to a wild-type target; in me lwo peptide sequences shown in sheet 2 of FIG. 8 are 332 

panel C, the chip was hybridized to a mutant AF508 target; and 318 » respectively. 

and in panel B, the chip was hybridized to a mixture of the FIG. 9 shows the human mitochondrial genome* u OJ* is 

wild-type and mutant targets. The SEQ ID. NOS. corre- the H strand origin of replication, and arrows indicate the 
sponding to the four peptide sequences shown in FIG. 3 are 20 cloned unshaded sequence. 

317-320, respectively. FIG. 10 shows the image observed from application of a 

FIG. 4, in sheets 1-3, corresponding to panels A, B, and sample of mitochondrial DNA derived nucleic acid (from 

C of FIG. 3, shows graphs of fluorescence intensity versus the mt4 sample) on a DNA chip. 

tiling position The labels on the horizontal axis show the FIG. U is similar to FIG. 10 but shows the imace 

Discs m the wild-type sequence corresponding to the posi- observed from tfacmtS sample 

S?°^ hC r ^ f cdvC pro , bcs ' Pl0 ! tcd . are . m ' FIG. 12 shows the predicted difference image between the 

«Z£ tS? k?' ^T* (0f SyDlhc . si ? SitC5) ^4andmt5samplesonmeDNAcmpbasedLmismatch« 

con aimng wfld-type probes^ the features containing the bctwccn !hc ^umpl* and the referent r£S5 

substitution probes that bound the most target ("called"), and rnn n u .u . * * c . ncc 5W 3 UCDCe - 

the feature containing the substitution probes that bound the 30 #K ^ difference image observed for 

. . • Urgetwifc me second mghestm^ me r^4 and mt5 samples. - - 

probes ("2nd Highest"). The SEQ ID. NOS. corresponding . . . ' m sheets l and 2 > sb6ws a P Iot of -normalized 

to the two peptide sequences shown in sheet 1 of FIG 4 are mlCD51ucs across rows 10 and 11 of the array and a tabula- 

321 and 318, respectively; the SEQ ID. NOS. corresponding to0D of thc mutations detected. 

to the two peptide sequences shown in sheet 2 of FIG. 4 are 35 * 5 shows the discrimination between wild-type and 

322 and 318, respectively; and the SEQ ID. NOS. corre- mutant hybrids obtained with the chip. A median of the six 
sponding to the two peptide sequences shown in sheet 3 of normalized hybridization scores for each probe was taken; 
FIG. 4 are 323 and 318, respectively. mc &*P^ P Iots the ratio of thc median score to thc normal- 

FIG. 5, in panels A, B, and C, shows an image made from M * cd h y bridi2ation *»« v «<sus mean counts. A ratio of 1.6 
a region of a DNA chip containing CFTR exon 10 probes; and mcaD counts abovc 50 y* cld no falsc positives, 
in panel A, the chip was hybridized to the wt480 target; in ^9* frustrates how the identity of the base mismatch 
panel C, the chip was hybridized to the mu480 target; and in mav influence the ability to discriminate mutant and wild- 
panel B, the chip was hybridized to a mixture of the type sequences more than the position of the mismatch 
wild-type and mutant targets. The SEQ ID. NOS. corre- 4J ^^n an oligonucleotide probe. The mismatch position is 
sponding to the peptide sequences shown in FIG. 5 are expressed as % of probe length from the 3*-end. The base 
324-327, respectively. change is indicated on the graph. 

FIG. 6, in sheets 1-3, corresponding to panels A, B, and FIG * 17 P rov * dc s a 5' to 3* sequence listing of one target 
C of FIG. 5, shows graphs of fluorescence intensity versus corresponding to the probes on the chip. X is a control probe, 
tiling position. The labels on thc horizontal axis show the 50 Posil * oos 1011 diffc r in thc target (lc n arc mismatched with 
bases in the wild-type sequence.corrcsponding to the posi- ^ probe at the designated site) are in bold. The SEQ ID. 
tion of substitution in the respective probes: Plotted are the NO. corresponding to the peptide, sequence shown in FIG. . 
intensities observed from the features (or synthesis sites) 17 * 

containing wfld-type probes, the features containing the FIG. 18 shows the fluorescence image produced by scan- 
substitution probes that bound the most target ("called"), and 55 nul S uc chip described in FIG. 17 when hybridized to a 
the feature containing the substitution probes that bound the sample. 

target with the second highest intensity of all the substitution FIG. 19 illustrates the detection of 4 transitions in the 
probes ("2nd Highest"). The SEQ ID. NOS. corresponding target sequence relative to the wild-type probes on the chip 
to the two peptide sequences shown in sheet 1 of FIG. 6 are in FIG. 18. 

328 and 329, respectively; the SEQ ID. NOS. corresponding 60 FIG. 20 shows the alignment of some of the probes on a 
to the two peptide sequences shown in sheet 2 of FIG. 6 are p 53 DNA chip with a 12-mer model target nucleic acid. The 
330 and 329, respectively; and the SEQ ID. NOS. corre- SEQ ID. NOS. corresponding to the fourteen peptide 
sponding to the two peptide sequences shown in sheet 3 of sequences shown in FIG. 20 are 334-347, respectively. 

H T. 331 " d 329 ' FIG. 21 shows a set of 10-mer probes for a p53 exon 6 

FIG. 7, in panels A and B # shows an image made from a 65 DNA chip. TTje SEQ ID. NOS. corresponding to the thirteen 

region of a DNA chip containing CFTR exon 10 probes; in peptide sequences shown in FIG. 21 are 334 and 34S-359, 

panel A, the chip was hybridized to nucleic acid derived respectively. ' 
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FIG. 22 shows that very distinct patterns are observed in the nucleotide sequence of a target nucleic acid with 
after hybridization of p53 DNA chips with targets having oligonucleotide probes of defined length. The length (L) of 
different 1 base substitutions. In the first image in FIG. 22, the probe is typically expressed as the number of nucleotides 
the 12-mer probes that form perfect matches with the or bases in a single-stranded nucleic acid probe. For pur- 
wild-type target are in the first row (top). The 12-merprobes 5 poses of the present invention, lengths ranging from 12 to 18 
with single base mismatches are located in the second, third, bases are preferred, although shorter and longer lengths can 
. and fourth rows and have much lower signals. : ' aIs0 °« employed. To employ the tiling method, one syn- • 

FIG. 23, in graphs 2, 3, and 4, graphically depicts the data < hcsizcs a «f! of probes defined by the particular nucleotide 

in FIG. 22. On each graph, the X ordinate is the position of ^ UCDCC of m mc tar S cl acid ' For cach basc 
the probe in its row on the chip, and the Y ordinate is the io > n lhe target DNA segment, one synthesizes a probe comple- 

signal at that probe site after hybridization. m . cnlarv f° ll l c subsequence of the target nucleic and begm- 

FIG. 24 shows the results of hybridizing mixed target ™* J lhat base and cndm * " bases to < he < sce 

populations of WT and muUnt p53 genes to the p53 DNA ' „ L * t . 

cm « 0 In a preferred embodiment of the mvention, the probes are 

FIG. 25, in graphs 1-4, shows (see FIG. 23 as well) the « »™"f d * ™™>™™«™. typically by covaleBl 

hybridization efficiency of a 10-mer probe array as com- *»^erU * " P r ^ ,hes, « d P 1 ?^ or b * ° f ,he 

pared to a 12 me obc a a ' P roDC 00 "* e substrate) on the substrate or chips in lanes 

P n - . " mer arr t^ p^m.i t_ * • • • j* * stretching across the chip and separated, and these lanes are 

t J dST " m * & ° DNA chip hybridized to m mZQgcd {q b|odc$ of preferabIy 5 lancs> although 

* -*~^! 20 blocks of other sizes will have useful application, as will be 

FIG. 27 illustrates how the actual sequence was read from aflparenl from the Mowing illustration. Hie first of these 
the chip shown in FIG. 26. Gaps in the sequence of letters fiyc laflCS| mc ^a.^ Ianc » f contains probes 

in the WT rows correspond to control probes or sites. arrangc d in order of sequence, and all of the probes are 
Positions at which bases are miscalled are represented by complementary to a specified wild-type nucleic acid 
letters in italic type in cells corresponding to probes m which ^ sequence. 7^ omcr four lanes contain probe sets for detect- 
the WT bases have been substituted by other bases. The SEQ iQg aI1 possio iJ. single-base mutations in the defined 
ID. NO. corresponding to the peptide sequence shown in sequence; in turn, these probe sets are defined by a position 
FIG. 27 is 360. of potential non -complementarity in the probe relative to the 

FIG. 25 illustrates the VLSIPS™ technology as applied to larg et (i.e., a single base mismatch) and the identity of the 
the light directed synthesis of oligonucleotides. Light (fav) is 30 nucleotide in the probe at that position (i.e., whether the 
shone through a mask (M,) to activate functional groups nucleotide is an A, C, G, or T nucleotide). The position of 
(—OH) on a surface by removal of a protecting group (X). mismatch, also called the.position of substitution, is prefer- 
Nucleoside building blocks protected with photoremovable ablyselected to be near the center of the probes, i.e., position 
protecting groups (T-X, C-X) are coupled to the activated 7 of a probe of L«15. 

areas. By repeating the irradiation and coupling steps, very 35 For cach probe m me wild-type lane, one synthesizes four 
complex arrays of oligonucleotides can be prepared. vtobes ( onc fof cacb of lhc j anes otaef lhan tht 

FIG. 29 illustrates how the VLSIPS™ process can be used j anc ) # 7^ 0 f these four probes is identical to the corre- 
to prepare "nucleoside combinatorials ,, or oligonucleotides sponding wild-type probe but for the base at the position of 
synthesized by coupling all four nucleosides to form dimers, substitution, and the remaining probe is identical to the 
tnmcrs, etc. 40 wild-type probe. This set of four substitution probes is 

FIG. 30 shows the deprotection, coupling, and oxidation preferably placed in a column directly below (or above) the 
steps of a solid phase DNA synthesis method. corresponding wild-type probe, thus creating an A-lane, a 

FIG. 31 shows an illustrative synthesis route for the C-Iane, a G-lane, and a T-lane. FIG. 2 shows an illustrative 
nucleoside building blocks used in the VLSIPS™ method. tiled array of the invention with probes for the detection of 

FIG. 32 shows a preferred photoremovable protecting 45 point mutations. The base at the position of substitution in 
group, MeNPOC, and how to prepare the group in active each of the wild -type probes is shown in the wild-type lane, 
form. and the shading shows the location of the substitution probe 

FIG. 33 illustrates an illustrative detection system for hiving the wild-type sequence. Below are the probes that 
scanning a DNA chip would be placed in the column marked by the arrow if the 

, DETAILED DESCRIPTION OF THE . ' 50 P»* wcre 15 and lhc .P° sitio - n of ^tinition were: 

INVENTION 3 ; -CCGACTGCAGTCGTT (SEQ. ID. NO:l) ■ ' ■ ''. 
Using the VLSIPS™ method, one can synthesize arrays 3-CCGACTACAGTCGTT (SEQ. ID. NO:2) 
of many thousands of oligonucleotide probes on a substrate, 3'-CCGACTCCAGTCGTT (SEQ. ID. NO:3) 
such as a glass slide or chip. The method can be used, for 55 3 f -CCGACTGCAGTCGTT (SEQ. ID. NO:l) 
instance, to synthesize "combinatorial" arrays consisting of, S-CCGACTTCAGTCGTT (SEQ. ID. NO:4) 
for example, all possible octanucleotides. Such arrays can be Thus, the substitution lanes occupy four of the five lanes 
used for primary sequencing-by-hybridization on genomic separating successive wild-type lanes on the chip; the blocks 
DNA fragments or other nucleic acids or to detect mutations of five lanes can be separated by a sixth lane for measure- 
in a target nucleic acid for which the normal or "wild-type" 60 meat of background signals. 

nucleotide sequence is already known. Using the preferred The DNA chips of the invention have a wide variety of 

method of the invention, one employs a strategy called applications. In one embodiment, the DNA chip is used to 

"tiling" to synthesize specific sets of probes or at spatially- select an optimal probe from an array of probes. In this 

defined locations on a substrate, creating the novel probe embodiment, an array of probes of variable length and 

arrays and "DNA chips'* of the invention. 65 sequences is synthesized and then hybridized to a target 

To illustrate the tiling method of the invention, consider nucleic acid of known sequence. The pattern of bybridiza- 

the problem of detecting mutations at one or more position tion reveals the optimal length and sequence composition of 
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probes to detect a particular mutation or other specific base substitution and any deletion within the 192-base exon, 

sequence of nucleotides. In some circumstances, i.e., target including the three-base deletion known as AF50S. As 

nucleic acids with repeated sequences or with high G/C described in detail below, hybridization of sub -nanomolar 

. content, very long probes may be required for optimal concentrations of wild-type and AF503 oligonucleotide tar- 

. detection.. In one embodiment for detecting specific 5 g cl nucleic acids labeled with fluorescein to these arrays 
: sequences in a target nucleic acid with" a DNAcbip, repeat * produces highly specific signals (detected with confocal 

'• sequences are detected 'as" follows, the chip comprises- scaling fluorescence microscopy) that permit discrimina- 

probes of length sufficient to extend into the repeat region don bctwccn mutanl Md wild-type target sequences in both 

varying distances from each end. The sample, prior to homozygous ? nd hctenwygws cases. Tne method and chips 

hybridization, is treated with a labeled oligonucleotide that io of * c . mv f n T * ™* to ? ht < 

is complementary to a repeat region but shorter than the full WaW ' 

• tU „ c . J , ~r , ° . . . . . . , . , The most common cystic fibrosis mutation is known as 

length of the repeaL ^ target nucleic is labeled with . ^ bccausc ^ fc a dcIelioQ 

second distinct label. After hybridization the chip is ^ts m ^ remova] oUminQ add ^ from ^ CFTR 
scanned for probes that have bound both the labeled target pfotcin . nc prcscnt iavcQlioa providcs DNA ^ for 
and the labeled oligonucleotide probe; the presence of such 15 detecting AF508, one such chip results from applying the 
bound probes shows that at least two repeat sequences are tiHo^ method to exon 10 of the CFTR gene, the exon to 
present, which AF508 has been mapped. The tiling method involved 

A variety of methods can be used to enhance detection of the synthesis of a set of probes of a selected length in the 
labeled targets bound to a probe on the array. In one range of from 10 to 18 bases and complementary to subse- 
embodiment, the protein MutS (from £. colt) or equivalent 20 queaces of the known wild-type CFTR sequence starting at 
proteins such as yeast MSH1, MSH2, and MSH3; mouse a position a few bases into the intron on the 5-side of exon 
Rep-3, and Streptococcus Hex-A, is used in conjunction 10 and ending a few bases into the intron on the S'-side. 
with target hybridization to detect probe-target complex that There was a probe for each possible subsequence of the 
contain mismatched base pairs. The protein, labeled directly given segment of the gene, and the probes were organized 
or indirectly, can be added to the chip during or after 25 into a "lane" in such a way that traversing the lane from the 
hybridization of target nucleic acid, and differentially binds upper left-hand comer of the chip to the lower righthand 
to homo- and heteroduplex nucleic acid. A wide variety of comer corresponded to traversing the gene segment base- 
dyes and other labels can be used for similar purposes. For by-base from the 5'-cnd. The lane containing that set of 
instance, the dye YOYO-1 is known to bind preferentially to probes is, as noted above, called the "wild-type lane." 
nucleic acids containing sequences comprising runs of 3 or 30 Relative to the wild-type lane, a "substitution" lane, called 
more G residues. . the "A-Iane", was synthesized on the chip. The'A-lane* 

The DNA chips produced by the methods of the invention probes were identical in sequence to ah adjacent 
can be used to study and detect mutations in exons of human (immediately below the corresponding) wild-type probe but 
genes of clinical interest, including point mutations and contained, regardless of the sequence of the wild-type probe, 
deletions. In the following sections, the method of the 35 a dA residue at position 7 (counting from the 3'-end). In 
invention is illustrated by the detection of mutations in a similar fashion, substitution lanes with replacement bases 
variety of clinically and medically significant human nucleic dC, dG, and dT were placed onto the chip in a "C-lane," a 
acid sequences. Thus, the invention is illustrated first with "G-Iane," and a "T-lane," respectively. A sixth lane on the 
respect to the preparation of DNA chips for the detection of chip consisted of probes identical to those in the wild-type 
mutations associated with cystic fibrosis, then with DNA 40 lane but for the deletion of the base in position 7 and 
chips for the detection of human mitochondrial DNA restoration of the original probe length by addition to the 
sequences, then with DNA chips for the detection of muta- S'-end the base complementary to the gene at that position, 
tions in the human p53 gene associated with cancer, and The four substitution lanes enable one to deduce the 
finally with respect to the detection of mutations in the HIV sequence of a target exon 10 nucleic add from the relative 
RT gene associated with drug resistance. 45 intensities with which the target hybridizes to the probes in 

Detection of Cystic Fibrosis Mutations with DNA Chips the various lanes. The probe organization on the chip can be 
A number of years ago, cystic fibrosis, the most common conveniently columnar, and the set of probes consisting of a 
severe autosomal recessive disorder in humans, was shown wild-type probe and four corresponding substitution probes 
to be associated with mutations in a gene thereafter named is referred to as a "column set." One and only one of the four 
the Cystic Fibrosis Transmembrane Conductance Regulator 50 substitution probes in a column set has exactly the same 
(CFTR) gene. The scquences'of the exons and parts of the sequence as the wild-type probe in the set. Those of skill in 
in trans in the gene are known, as are the changes corre- the art will appreciate' that, in other embodiments of- the 
spondiog to several hundred known mutations. Several tests invention, one could delete one or more lanes or columns 
have been developed for detecting the most frequent of these and still benefit from the invention. Various versions of such 
mutations. The present invention provides CFTR gene oli- 55 exon 10 DNA chips were made as described above with 
gonucleotide arrays (DNA chips) that can be used to identify probes 15 bases long, as well as chips with probes 10, 14, 
mutations in the CFTR gene rapidly and efficiently. and 18 bases long. For the results described below, the 

The methods used to make the high-density DNA chips of probes were 15 bases long, and the position of substitution 
the invention allow probes for long stretches of DNA coding was 7 from the 3*-end. 

regions to be directly "written** onto the chips in the form of 60 To demonstrate the ability of the chip to distinguish the 
sets of overlapping oligonucleotides. These methods have AF50S mutation from the wild-type, two synthetic target 
been used to develop a number of useful CFTR gene chips, nucleic acids were made. The first, a 39-mer complementary 
one illustrative chip bears an array of 1296 probes covering to a subsequence of exon 10 of the CFTR gene having the 
the full length of exon 10 of the CFTR gene arranged in a three bases involved in the AF508 mutation near its center, 
36x36 array of356Xm elements. The probes in the array can 65 b called the "wild-type" or wt508 target, corresponds to 
have any length, preferably in the range of from 10 to 18 positions 111-149 of the exoo, and has the sequence shown 
residues and can be used to detect and sequence any single- below. 
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5 '-CATTAAAG AAAATAICATCTTTG GTGTTTCCTAT- whose point of substitution corresponds to the T at the 3'-end 
GATGA (SEQ. ID NO: 5). of the deletion was very close to background. Following that 

The second, a 36-mer probe derived from the wild-type pattern, the wild-type probe whose point of substitution 
target by removing those same three bases, is called the corresponds to the middle base (also a 7) of the deletion 
. "mutant" target or mu50S target and has the sequence shown 5 bound still less target. However, the probe in the T-lane of 
below, first with dashes to indicate the deleted bases, . and that column set bound the target very well, 
then without dashes but with one base underlined (to indi-. Examination of the sequences of the two targets .reveals 
cate the base detected by the T-lane probe, as discussed that the deletion places an A at that position when the 
below): sequences are aligned at their 3 -ends a od that the T-lane 

S'-CATTAAAGA A AATATCAT--- 10 probe is complementary to the mutant target with but two 
TGGTGTTTCCTATGATGA; (SEQ. ID NO:6) mismatches near an end (shown below in lower-case letters, 

5 '-CAITAAAG AAAATATCATTG GTGTTTCCTATG ATG A. with the position of substitution underlined): 

(SEQ. ID NO:7) Target: 5 1 - CATTAAAG AAAATATCATTG GTGT- 

Both targets were labeled with fluorescein at the 5'-end. TTCCTATGATGA 

In three separate experiments, the wild-type target, the 15 Probe: 3'-TagTAGTAACCACAA (SEQ. ID NO:8) 
mutant target, and an equimolar mixture of both targets was Thus the T-lane probe in that column set calls the correct 
exposed (0.1 nM wtSOS, 0.1 nM mu508, and 0.1 nM wt508 base from the mutant sequence. Note that, in the graph for 
plus 0.1 nM mu50S, respectively, in a solution compatible the equimolar mixture of the two targets, that T-lane probe 
with nucleic acid hybridization) to a CF chip. The hybrid- binds almost as much target as does the A-lane probe in the 
ization mixture was incubated overnight at room 20 same column set, whereas in the other column sets, the 
temperature, and then the chip was scanned on a reader (a probes that do not have wild-type sequence do cot bind 
coo focal fluorescence microscope in photon-counting mode; target at all as well. Thus, that one column set, and in 
images of the chip were constructed from the photon counts) particular the T-lane probe within that set, detects the AF50S 
at several successively higher temperatures while still in mutation under conditions that simulate the homozygous 
contact with the target solution. After each temperature 25 case and also conditions that simulate the heterozygous case, 
change, the chip was allowed to equilibrate for approxi- The present invention thus provides individual probes, 
mately one-half hour before being scanned. After each set of sets of probes, and arrays of probe sets on chips, in specific 
scans, the chip was exposed to denaturing solvent and patterns, as me probes provide important benefits for detect* 
conditions to wash, i.e., remove target that bad bound, the ing the presence of specific exon 10 sequences. The 
chip so that the next experiment could be done with a clean 30 sequences of several important probes of the invention are 
chip. . shown below. In each case, the letter "X** stands for the point 

The results of the experiments are shown in FIGS. 3, 4, 5, of substitution in a given column set, so each of the 
and Si FIG. 3, in panels A, B, and C, shows an image made sequences actually represents four probes, with A, Q G, and 
from the region of a DNA chip containing CFTR exon 10 T, respectively, taking the place of the "X." Sets of shorter 
probes; in panel A, the chip was hybridized to a wild-type 35 probes derived from the sets shown below by removing up 
target; in panel C, the chip was hybridized to a mutant delta to five bases from the 5*-end of each probe and sets of longer 
508 target; and in panel B, the chip was hybridized to a probes made from this set by adding up to three bases from 
mixture of the wild-type and mutant targets. FIG. 4, in sheets the exon 10 sequence to the 5'-end of each probe, are also 
1-3, corresponding to panels A, B, and C of FIG. 3, shows useful and provided by the invention, 
graphs of fluorescence intensity versus tiling position. The 40 S'-TITATAXTAGAAACC (SEQ. ID NO:9) 
labels on the horizontal axis show the bases in the wild-type 3'-TTATAGXAGAAACCA (SEQ. ID NO;10) 
sequence corresponding to the position of substitution in the 3-TATAGTXGAAACCAC (SEQ. ID NO:ll) 
respective probes. Plotted are the intensities observed from 3-AIAGTAXAAACCACA (SEQ. ID NO:12) 
the features (or synthesis sites) containing wild-type probes, 3-TAGTAGXAACCACAA (SEQ. ID NO:13) 
the features containing the substitution probes that bound the 45 3-AGTAGAXACCACAAA (SEQ. ID NO:14) 
most target ("called"), and the feature containing the sub- 3 -GTAGAAXCCACAAAG (SEQ. ID NO:15) 
stitution probes that bound the target with the second highest 3-TAGAAAXCACAAAGG (SEQ. ID NO:16) 
intensity of all the substitution probes ("2nd Highest"). 3-AGAAACXACAAAGGA (SEQ. ID NO:l^ 

These figures show that, for the wild-type target and the Although in this example the sequence could not be 
equimolar mixture of targets, the substitution probe with a 50 reliably deduced near the ends of the target, where there is - 
nucleotide sequence identical to the corresponding wild- .. not enough overlap between target and probe , to allow 
type probe bound the most target, allowing for ah unam- effective hybridization, and around the center of the target, " 
biguous assignment of target sequence as shown by letters where hybridization was weak for some other reason, per- 
near the points on the curve. The target wt508 thus hybrid- haps high AT-content, the results show the method and the 
ized to the probes in the wild-type lane of the chip, although 55 probes of the invention can be used to detect the mutation of 
' the strength of the hybridization varied from probe-to-probe, interest. The mutant target gave a pattern of hybridization 
probably due to differences in melting temperature. The that was very similar to that of the wt508 target at the ends, 
sequence of most of the target can thus be read directly from where the two share a common sequence, and very different 
the chip, by inference from the pattern of hybridization in in the middle, where the deletion is located. As one scans the 
the lanes of substitution probes (if the target hybridizes most 60 image from right to left, the intensity of hybridization of the 
intensely to the probe in the A-lane, then one infers that the target to the probes in the wild-type lane drops off much 
target has a T in the position of substitution, and so on). more rapidly near the center of the image for mu508 than for 
For the mutant target, the sequence could similarly be wt508; in addition, there is one probe in the T-lane that 
called on the 3'-side of the deletion. However, the intensity hybridizes intensely with mu50S and hardly at all with 
of binding declined precipitously as the point of substitution 65 wt508. The results from the equimolar mixture of the two 
approached the site of the deletion from the 3'-cnd of the targets, which represents the case one would encounter in 
target, so that the binding intensity on the wild-type probe testing a heterozygous individual for the mutation, are a 
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blend of the results for the separate targets, showing the terns. The wild-type sequence could easily be read from the 

power of the invention to distinguish a wild-type target chip, but the probe that bound the mu480 target so well when 

sequence from one containing the AF50S mutation and to only the mu480 target was present also bound it well when 

detect a mixture of the two sequences. D0[ h the mutant and wild-type targets were present in a 

The results above clearly demonstrate how the DNA chips s mixture , making the hybridization pattern easily distinguish- 

invention ,0 dete< = t •*!«»"«» mutation, . ab , 6 from mat of me , arget 7^ rcsuIts 

AF508; another model system was used to show that the ^ the of , he DNAthipsof the fo^ntfch to' 

chips can also be used to detect a poinl ^mutation as welb One £ ^ muta[ions fa ^ homo . heterozraous 

of the more frequent mutations id the CFTR gene is G4S0C, . .. . , r . /& 

which involves the replacement of the G in position 46 of 10 10 ' Vl _ . . , ,. * . n vn 

exon 10 by a T, resulting in the substitution of a cysteine for u T <? *~nstratc application of the DNA chips of 

the glycine normally in position #480 of the CFTR protein. *c invention, the chips were used to study and detect 

Hie model target sequences included the 21-mer probe muttons in nucleic acids from genomic samples. Genomic 

wt480 to represent the wild-type sequence at positions samples from a individual carrying only the wild-type gene 

37-55 of exon 10: S-CCTTCAGAGGGTAAAATTAAG 15 and an individual heterozygous for AF50S were amplified by 

(SEQ. ID NO:18) and the 21-mer probe mu480 to represent PCR using exon 10 primers containing the promoter for T7 

the mutant sequence: 5*-CClTCAGAGTGTAAAAiTAAG RNA polymerase. Illustrative primers of the invention arc 

(SEQ. ID NO:19). shown below. 



Exoa Name Sequence 

10 CFi9*T7 TAATACGACIX^CTATAGOGAGatgaocUalaatgalgggni (SEQ. ED. NO:20) 

10 CFU0C-T7. TAATACGACn>CTATAGGGAGtagtgtgugggttcatatgc (SEQ. CO. NOtZl) 

10 CH10C-T3 CTOGGAAnAAOCCTtACIAAAGGagtgtgsagggtlcatatg (SEQ. ED- K022) 
10, 11 CFU0-T7 TAAlACGACrr^CTVOAGGQAGagcatacUaaagtgactctc (SEQ. ID. NO:23) 

11 CFillc-T7 TAATACGACTCAC^ATAGOOAGacatgaatgacatttacagcaa (SEQ. CD. 

11 CFUloT3 CGGAATTAACCCTCACTAAAGOacalgaitgacatttacagcaa (SEQ. CD. KOOS) 



In separate experiments, a DNA chip was hybridized to 30 These primers can be used to amplify exon 10 or exon 11 
each of the targets wt4S0 and mu4S0, respectively, and then . sequences; in another embodiment, multiplex ; PCR - is 
scanned with a confpeal microscope. FIG. 5, in panels A, B, employed, using two or more pairs of primcrs.to -amplify, 
and C, shows an image made from me region of a DNA chip more than one exon'at a time. 

containing CFTR exon 10 probes; in panel A, the chip was The product of amplification was then used as a template 
hybridized to the wt480 target; in panel C, the chip was 35 for the RNA polymerase, with fluoresceinated UTP present 
hybridized to the mu480 target; and in panel B, the chip was to label the RNA product. After sufficient RNA was made, 
hybridized to a mixture of the wild-type and mutant targets. it was fragmented and applied to an exon 10 DNA chip for 
FIG. 6, in sheets 1-3, corresponding to panels A, B, and C 15 minutes, after which the chip was washed with hybrid- 
of FIG. 5, shows graphs of fluorescence intensity versus ization buffer and scanned with the fluorescence micro- 
tiling position. The labels on the horizontal axis show the 40 scope. A useful positive control included on many CF exon 
bases in the wild-type sequence corresponding to the posi- 10 chips is the 8-mer 3-CGCCGCCG-5*. FIG. 7, in panels 
tion of substitution in the respective probes. Plotted arc the A and B, shows an image made from a region of a DNA chip 
intensities observed from the features (or synthesis sites) containing CFTR exon 10 probes; in panel A, the chip was 
containing wild-type probes, the features containing the hybridized to nucleic add derived from the genomic DNA of 
substitution probes that bound the most target ("called"), and 45 an individual with wild-type AF508 sequences; in panel B, 
the feature containing the substitution probes that bound the the target nucleic acid originated from a heterozygous (with 
target with the second highest intensity of all the substitution respect to the AF508 mutation) individual FIG. 8, in sheets 
probes ("2nd Highest"). 1 and 2, corresponding to panels A and B of FIG. 7, shows 

These figures show that the chip could be used to graphs of fluorescence intensity versus tiling position, 
sequence a 16-base stretch from the center of the target 50 These figures show that the sequence of the wild-type^ 
wt480 and that Discrimination against mismatches is quite RNA can be called for most of the bases near the mutation. 
• good throughout the sequenced region. When the DNA chip * In the case of the AF508 heterozygous carrier^ one particular 
was exposed to the target mu480, only one probe in the probe, the same one that distinguished so clearly between 
portion of the chip shown bound the target well: the probe the wild-type and mutant oligonucleotide targets in the 
in the set of probes devoted to identifying the base at 55 model system described above, in the T-lane binds a large 
position 46 in exon 10 and that has an A in the position of amount of RNA, while the same probe binds little RNAfrom 
substitution and so is fully complementary to the central the wild-type individual. These results show that the DNA 
portion of the mutant target All other probes in that region chips of the invention are capable of detecting the AF508 
of the chip have at least one mismatch with the mutant target mutation in a heterozygous carrier, 
and therefore bind much less of it. In spite of that fact, the 60 Thus, the present invention provides methods for synthe- 
sequence of mu480 for several positions to both sides of the sizing large numbers of oligonucleotide probes on a glass 
mutation can be read from the chip, albeit with much- substrate and unique probe sets in a defined array in which 
reduced intensities from those observed with the wild-type the probes are arranged in the array by the "tiling" method 
target. of the invention. The DNA chips produced by the method 

The results also show that, when the two targets were 65 can be used to detect mutations in particular sequences of a 
mixed together and exposed to the chip, the hybridization target nucleic acid, such as genomic DNAor RNA produced 
pattern observed was a combination of the other two pat- from transcription of an amplified genomic DNA. These 



t 
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chips can be used to detect both point mutations and small 
deletions. Moreover, the pattern of hybridization to the chip 
allows inferences to be drawn about the sequences of the 
mutant DNAs. 

For example, in the model system involving the cystic 
.fibrosis point mutation G480C, the A-lane probe whose 
position of substitution corresponds, to the position of the 
mutation does not bind much wild-type target, because in the 
wild-type sequence, a G occupies that position. However, it 



some applications to using a minimal set of oligonucleotides 
specific to the sequence of interest, rather than a set of all 
possible N-mcrs. Some of these advantages include; (i) each 
position in the array is highly informative, whether or not 
hybridization occurs; (ii) nonspecific hybridization is mini- 
mized; (ill) it is straightforward to correlate hybridization 
differences with sequence differences, particularly with ref- 
erence to the hybridization pattern of a known standard; and 
(iv) the ability to address each probe independently during 



binds mutant target very well, allowing one to infer correctly 10 synthcsIs> using high rcso i ution photolithography, allows the 



that the mutation involves a change of that G to a T. 
Similarly, in the case of the three-base deletion in cystic 
fibrosis known as AF508, the T-lane probe that binds mutant 
target so intensely is responding to the fact that the deletion 
has brought a CAT sequence into the position occupied by 15 
a CTT sequence in the wild-type target. The DNA chips of 
the invention can be used to detect and sequence cot only 
known mutations in an organism's genome but also new 
mutations not previously characterized. The DNA chips and 



array to be designed and optimized for any sequence. For 
example the length of any probe can be varied independently 
of the others. 

The present invention illustrates these advantages by 
providing DNA chips and analytical methods for detecting 
specific sequences of human mitochondrial DNA In one 
preferred embodiment, the invention provides a DNA chip 
for analyzing sequences contained in a 13 kb fragment of 



methods of the invention can also be used to detect specific 20 human mitochondrial DNA from the "D-loop" region, the 



sequences in .other CFTR exons as well as other human 
genes for' purposes of research and clinical genetic analysis, 
as demonstrated below. 

Detection of Specific Human Mitochondrial DNA 
Sequences with DNA Chips 

As noted above, the present invention provides DNA 
chips on which a known DNA sequence is represented as an 
array of overlapping oligonucleotides on a solid support. 
This set of oligonucleotides is used to probe a target nucleic 



most polymorphic region of human mitochondrial DNA 
One such chip comprises a set of 269 overlapping oligo- 
nucleotide probes of varying length in the range of 9-*14 
nucleotides with varying overlaps arranged in -600x600 
25 micron features or synthesis sites in an array 1 cmxl cm in 
size. The probes on the chip are shown in columnar form 
below. An illustrative mitochondrial DNA chip of the inven- 
tion comprises the following probes (X, Y coordinates are 



shown, followed by the sequence; "DL3" represents the 
acid comprising the known sequence, allowing mutations to 30 3-end of the probe, which is covalently attached to the chip 
be detected. As also noted above, there are advantages in surface.). 



0 0 D L3 AGTGGGGTATTT 

1 0 DUGOOTATITAGTr 

2 0 DL3TTAGTITATCCAA 

3 0 DL3ATCCAAACCAGG 

4 0 D L3 ACCAGG ATCGG A 

5 0 DL3CGTGTGTGTGTGG 

6 0 DL30GTGTGTGTGTGGC 

7 . 0 DL3TCGTGTGTGTGTGG 

8 0 DL3GTAGGATGGGTC 

9 0 DL3AGGATGGGTCGT 

10 0 DL3GATGGGTCGTGT 

11 0 DL3TGGCGACGATTG 

12 0 DL3GOGAOGATTGGG 

13 0 DL3TCGGGGGGA 

14 0 DL3GAGGGGGOG 

15 0 DL3GGAGGGGGOGA 

16 0 DL3GAGGGGGOGA 

0 1 DUGGCITGGTIGG 

1 1 DL3GGTTGGTTTGGG 

2 1 DUTCGGGTrTCTAG 

3 1 DL3GTTTCTAGTGGG 

4 1 DL3AGTGGGGGGTGT. 

5 1 DL3GGGGTGTCAAAT 

6 1 DL3GTCAAATACATCG 

7 1 D L3 ACATCG AXTGG AG 

8 1 DL3CGAATGGAGGAG 

9 1 DL3GAGGAGTITCGT 

10 1 DLTTTTCOTTATGTGA 

11 1 DL3 ATGTGAC'l 1 TIAC 

12 1 DL3GACTTTTACAAAT 

13 1 DL3AAATCTGCCCGA 

14 1 DL3AATCTGCCCGAG 

15 1 DUOCCGAGTGTAGT 

16 1 DUAGTGTAGTGGGG 

0 2 DL3GGGAGGGTGAG 

1 2 D L3 GGTG AGGGTATG 

2 2 D L3GGTATGATG ATTAG 

3 2 D L3 G ATTAG AGIAAGT 

4 2 D L3TTAG AGTAAGTTA 



(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ n>. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 



NO:26) 
N027) 
NO-.28) 
NO:29) 
NO-JO) 
N031) 
N032) 
N033) 
N034) 
NO:35) 
NOJ6) 
NOJ7) 
NOJ3) 



(SEQ ID. KO-J9) 
(SEQ ID. NO:40) 
(SEQ ID. NO:41) 
(SEQ ID. NO:42) 
(SEQ ID. NO:43) 
(SEQ ID. NO:44) 
(SEQ ID. NO-.45) 
(SEQ ID. NO;46) 
(SEQ ID. NO:47) 
(SEQID.NO:43) 
(SEQ ID. NO:49) 
(SEQID.NO50) 
(seq nx NOSl) 

(SEQ ID. NO-.52) 
(SEQ ID. N053) 
(SEQ ID. N054) 
(SEQ ID. N055) 
(SEQID.N056) 
(SEQ ID. KOST) 
(SEQ ID. N038) 
CSEQID.N059) 
(SEQID.NO:60) 
(SEQ ID. NCh6J) 
(SEQID.NO:62) 



9 2 D L3GGTAGG ATGGGT 

10 2 DL3GGATGGGTCGTG 

11 2 DL3G O TCGTGTGTGT 

12 2 DUGTGTGTGTGGCG 

13 2 DL3TGTGGOGACGAT 

14 2 DL3GACGATTGGGGT 

15 2 D L3 ATTGGGGTATGG 

16 2 DL3GTATGGGGdTG 

0 3 DL3GGATTGTGGTCG 

1 3 DL3TGGTCGGATTGG 

2 3 DL3GGATTGGTCTAAA 

3 3 D L3TCTAAAGTTTAAA 

4 3 D L3GTTTAAAATAG AA 

5 3 DL3ATAGAAAAACCG 

6 3 DL3AGAAAAACCGC 

7 3 DL3AACCGOCATAC 

8 3 D L3CCATACGTG AAAA 

9 3 D L3 ACGTG AAAATTGT 

10 3 D1JAATTGTCAGTGGG 

11 3 DUTGTCAGTGGGGG 

12 3 • DL3TGGGGTTGA 

.13 3 DL3GGGTTGATTGTGT 

14 3 DL3TTGTGTAATAAAA 

15 3 DUAATAAAAGGGGA 

16 3 D L3TAAAAGGGG AGG 

0 4 DUGllIiliAAAGG 

1 4 DL31 III AAAGGTGG 

2 4 DL3AGGTGGTTTGG 

3 4 DL3TTGGGGGGGAG 

4 4 DUGGAGGGGGCG 

5 4 DL3GGGGCGAAGAC 

6 4 D L3GAAG ACOGG ATG 

7 4 DL3CCGGATGTCGTO 

8 4 DL3GTCGTGAA1 i IU1 

9 4 D L3CGTG AATTTGTGT 

10 4 DL3TTGTGTAOAGACG 

11 4 DL3TAGAGAOGGTIT 

12 4 DUACGGTTTXKjGG 

13 4 DL3TGGGG1 1 1 1 lOT 

14 4 DL3GGGTTTTTGTTT 



(SEQ ID. NO:67) 
(SEQ ID. NO:68) 
(SEQ ID. NO:69) 
(SEQ ID. NO:70) 
(SEQ ID. NO:7l) 
(SEQ ID. NO:72) 
(SEQ ID. NO;73) 
(SEQ ID. KO:74) 
(SEQ ID. NO:75) 
(SEQ ID.KO:76) 
(SEQ ID. NO:77) 
(SEQ ID. NO:78) 
(SEQ ID. NO:79) 
(SEQ ID. NO:80) 
(SEQ ID. NO:81) 
(SEQ ID. NO.82) 
(SEQ ID. NO:83) 
(SEQ ID. NOS4) 
(SEQ ID. NO:85) 
(SEQ ID. NO:86) . 
(SEQ ID. NO:87) 
(SEQ ID. NO:88) 
(SEQ ID. NO:89) 
(SEQ ID. NO^O) 
(SEQ ID. NOdl) 
(SEQ ID. NO:92) 
(SEQ ID.N053) 
(SEQ ID. N034) 
(SEQ ID. N095) 
(SEQ ID. NOS6) 
(SEQ ID. NO:97) 
(SEQ ID. N038) 
(SEQ ID. N059) 
(SEQID.KO:100) 
(SEQ ID. KO:101) 
(SEQ ID. NO:102) 
(SEQ ID. NO:103) 
(SEQID.NChl04) 
(SEQID.NO:105) 
(SEQID.NO:106) 



5,837,832 



15 



16 



-continued 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
9 
10 
11 
12 
13 
14 

15 

16 

0 

1 

2 

3 
' 4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

5 

6 

7 
8 
9 



5 2 DUAAGTTXrcTITKK} 

6 2 DL3GTTGGGGGCG 

7 2 DL3GGGGGGGGTA 

8 2 DUGCGGGTAGGAT 

2 5 DL3ACACAATTAATTAA 

3 5 DL3AATTAATTACGAA 

4 5 DL3TACOAACA1XXTG 

5 5 DL3ACGAACATCCTGT 

6 5 D L3T0CTGTATTATTA 

7 5 D L3 GTATTATTATTGTT 

8 5 DL3ATTGTIAAACTTA 

9 5 DL3AAACTTACAGACG 

10 5 DL3ACAGACGTGTCO 

11 5 D L3 GTGTCGGTG AAA 

12 5 DL3GTGAAAGGTGTGT 

13 5 D L3 GGTGTGTCTG TAG 

14 5 DL3TGTGTCTGTAGTA 

15 5 D L3 GTAGTATTGTTTT 

16 5 DL3AGTATTGTTTT7T 

0 6 D L3 CCTCGTGGG ATA 

1 6 DL3TGCGATACAGCG 

2 6 DL3GATACAGCGTCAT 

3 6 DL3GCGTCATAGACAG 

4 6 DL3AGACAGAAACTAA 

5 6 DL3CAGAAACTAAGGA 

6 6 D L3TAAGG ACGG AGT 

7 6 DL3 GAOGG AGTAGG A 

8 6 D L3GTAGGATAATAAA 

9 6 D L3TAATAAATAGCG 

10 6 DL3ATAGCGTAGGAT 

11 6 DL3TAGCGTAGGATG 

12 6 DL3AGGATGCAAGTT 

13 6 DL3ATGCAAGTTATAA 

14 6 DL3GTTATAATGTCCG 

15 6 DL3ATGTCCGC1 1G1 

16 6 DL3TCCG C1 1G1A TG 
0 7 DL3GTGAGTGCOCTC 

7 DL3TGC0CT0GAGAG ■ 
7 D L3 CCTCG AG AGGTA 
7 D L3 AG AGGTACGTAA 
7 D L3ACGTAAACCATA 
7 DL3ACCATAAAAGCAG 
7 D L3 AAAGCAG ACCC 
7 D L3 AG ACCCCCCAT 
7 DL3CCCCCATACGT 
7 DL3CATACGTGCGCT 
7 DL3GTGCGCTATCAG 
7 DL3GCGCTATCAGTA 
7 DL3TCAGTAACGCTC 
7 DL3GTAACGCTCTGC 
10 DUAGTCTATCCCCA 
10 DL3ATC0CCAGGGA 
10 DL3CAGGGAACTGGT 
10 DL3ACTGGTGGTAGG 
10 DL3CTGGTGGTAGGA 
10 DL3GTAGGAGGCACA 
10 DUGGCACAnTAGT 

10 DL3TTTAGTTATAGGG 

11 DL3AGGTTTACGGTG 
11 D L3TACGGTGGGG A 
11 DL3GTGGGGAGTGG 
1 1 DL3GGGAGTGGGTGA 
11 DL3GGGTGATCCTATG 
11 DL3CCTATGGTTGTTT 
11 DL3GGI IG1 1 1GGATG 
11 DL3GTTTGGATGGGT 
11 DUATGGGTGGGAAT 
11 DL3GGGAATTGTCATG 
11 D L3 GTCATGTATCATGT 
11 DUTCATGTATTTCGG 
11 DL3TATTTCGGTAAA 
11 DL3TTCGGTAAATGG 
11 DL3GTAAATGGCATGT 
11 DUGCATGTAATCGTG 

11 DL3GTAATCGTGTAAT 

12 D 13 GGGAGGGGTAC 
12 DUGGGTACGAATGT 
12 DUACGAATGTrcGTT 
12 D L3TGTTCGTTCATGT 
12 DL30G1 1 CATGTCGTT 



(SEQ ID. NO:63) 15 
CSEQ ID. NO:64) 16 
(SEQ ID. NO:65) 0 
(SEQ ID. NCX66) 1. 
(SEQ ID. NOilll) 14 
(SEQ ID. NO:112) *- 15 
CSEQ ID. NO:113) 16 
(SEQ ID. KO:114) 0 
CSEQ ID. NO:ll5) 1 
(SEQID.NO:ll6) 2 
(SEQ ID. NO-.117) 3 
(SEQ ID. NO:118) 4 
(SEQ ID. KO:119) 5 
(SEQID.NO:120) 6 
(SEQ ID. NO:121) 7 
(SEQ ID. NO:122) 8 
(SEQ ID. NCX123) 9 
(SEQ ID. NCX124) 10 
(SEQ ID. NO:125) 11 
(SEQ ID. NO:126) 12 
(SEQ ID. NO:127) 13 
(SEQ ID. NO:128) 14 
(SEQ ID. NO:129) 15 8 
(SEQ ID. NO:130) 16 8 
(SEQ ID. KOJ31) 0 9 
(SEQ ID. NO.132) 1 9 
(SEQ ID. NO:l33) 2 9 
(SEQ ID. NOU34) 3 9 
(SEQ ID. NOJ35) 4 9 
(SEQ ID. NOU36) 5 9 
(SEQ ID. NO:137) 6 9 
(SEQ ID. N0:138) 7 9 
(SEQ ID. NO:l39) 8 9 
(SEQ ID. NO:140) 9 9 
(SEQ ID. NO:141) 10 9 
(SEQ ID. NO:142) 11 9 
(SEQ ID. NO:143) 12 9 
, (SEQ ID. NO:144) 13- 9 
(SEQ ID. NOJ45) 14 9 
(SEQ ID. NO:146) 15 9 
(SEQ ID. NOa47) 16 9 
(SEQ ID. NOU4S) 0 10 
(SEQ ID. NCX149) 1 10 
(SEQ ID. NO150) 2 10 
(SEQ ID. NO-.151) 3 10 
(SEQ ID. NOJ52) 4 10 
(SEQ ID. NO:153) 5 10 
(SEQ ID. NO:154) 6 10 
(SEQ ID. NO:155) 7 10 
(SEQ ID. NO.156) 8 10 
(SEQ ID. NO203) 11 13 
(SEQ ID. NO204) 12 13 
(SEQ ID. NO205) 13 13 
(SEQ ID. NO206) 14 13 
(SEQ ID. NO:207) 15 13 
. (SEQ ID. NO208) 16 13 
(SEQ ID. NO-.209) 5 14 
(SEQ ID. NO210) 6 14 
(SEQ ID. K0211) 7 14 
(SEQ ID. K&212) 8 14 
(SEQ ID. NO-.213) 9 14 
(SEQ ID. N0214) 10 14 
(SEQ ID. N0215) 11 14 
(SEQ ID. K0216) 12 14 
(SEQ ID. N0217) 13 14 
(SEQ ID. N0218) 14 14 
(SEQHXN0219) 15 14 
(SEQ ID. NO220) 16 14 
(SEQID.NO:221) 5 15 
(SEQ ID. N0222) 6 15 
(SEQ ID. N0223) 7 15 
(SEQ ID. N0224) 8 15 
(SEQ ID. K&225) 9 15 
(SEQ ID. N0226) 10 15 
(SEQ ID. K0227) 11 15 
(SEO ID. N0228) 12 15 
(SEQ ID. NO-.229) 13 15 
(SEQ ID. N0230) 14 15 
(SEQ ID. K0231) 15 15 
(SEQ ID. NCX232) 16 15 



4 
4 

5 
5 
7 
7 
7 
8 
8 
8 
8 
8 
8 
8 
8 
8 
8 
8 
8 
8 
8 
8 



D1JTTGTTTCTTGGG 
DUTCTTGGGATTGTG 
DLJTCTATGAATGATTT 
DL3TGATTTCACACAA 
DL3CTCTCCGACCTC 
DL3GACCTCGGCCT 
DUTCGGCC I'CGTG 
DL3GATGAAGTCCCAG 
DL3AGTCCCAGTATTT 
DUGTATTTCGGATTT 
DIJTCGGATTTATCG 
DL3GATTTATCGGGT 
DUATCGGGTGTGCA 
DL3TGTGCAAGGGGA 
DL3CAAGGGGAATTT 
DUGAATTTATTCTGTA 
DL3TCTGTAGTGCTAC 
DL3GTAGTGCTACCT 
DL3GCTACGTAGTAG 
DL3CTAGTAGTOCAGA 
D L3TCCAG ATA9TGGG 
DL3AGATAGTGGGATA 
DL3GCGATAATTGGT 
DL3TAATTGGTGAGTG 
DL3TATAGGGCGTGT 
DL3GGGCGIGMC1CA 
DL3GTG ITCrCACGAT 
DL3TCACGATGAGAGG 
DUATGAGAGGAGCG 
DL3AGGAGCGAGGC 
DL3CGAGGCCCGG 
DL3GCCCGGGTATT 
D UCGGGTATTGTG A 
DL5GTGAACCCCCAT 
DL3CCCCATCGATTT 
DIJATCGATTTCACTT 
D UTTTCACTTG ACAT 
D L3TTGACATAG AGCT 
bUTAGAGCTCTAGAC 
DUGTAGAOCAAGGA 
DUACCAAGGATGAAG 
DL3CGTGTAATGTCAG 
DUrGTCAGTTTAGGG 
DL3TCAGTTTAGGGA 
DL3TAGGGAAGAGCA 
DL3AAGAGCAGGGGT 
DUCAGGGCTXCCTA 
DL3GGTACCIACTGG 
DL3TACTGGGGGGA 
DL3GGGGGAGTCIAT 
DL3CATGTA1 i 1 1 1 GG 
DL3T1 1 1GGGTTAGG 
DL3GGGTTAGGATGT 
DUGGAT GTAOl liiG 
D L3TGTAG111 IGGG 
DL3TTTGGGGGAGG 
DLKKKjTTCATAACTG 
D U ATAACTGAGTGGG 
DL3AACTGAGTGGGT 
DL3GTCGGTAGTTGT 
DL3GTAG1 IGITGGC 
; DL3GTTGGCGATACA 
D L3CG ATACATAAAA G 
DL3TAAAAGCATGTAA 
DL3GCATGTAATGACG 
DL3ATGACGGTCGGT 
DL3GTCGGTGGTACT 
DL3GGTACTXATAACA 
DL3TCGATTCTAAOAT 
DUTAAGATTAAATTT 
D L3 AAATTTG AATAAG 
DL3AATAAGAGACAAG 
DUAAGAGACAAGAAA 
DL3AAGAAAGTACCC 
DUAAAGTACXXCTT 
DUCCXXTTOCTCTA 
DL3CTTCCTCIAAAC 
DL3CIAAACCCATGG 
DUAACCCATGGTGG 
DL3TGGTGGGTTCAT 



(SEQ ID. NO:107) 
(SEQ ID. NO:103) 
(SEQ ID. KO:109) 
(SEQ ID. NO:110) 
(SEQ [D. NO:157) 
(SEQ ID. NO:15S) 
(SEQ ID. NO:159) 
(SEQ ID. NO: 160) 
(SEQID.KO:16l) 
(SEQID.NO:162) 
(SEQ ID. NO:l 63) 
(SEQID.NO:164) 
(SEQ ID. NO-.165) 
(SEQED.NO:166) 
(SEQ ID. NO:167) 
(SEQ ID. NO:168) 
(SEQ ID. NCX169) 
(SEQ ID. NO:170) 
(SEQ ID. KO:17l) 
(SEQ ID. NO:172) 
(SEQ ID. NOU73) 
(SEQ ID. KO:174) 
(SEQ ID. KO:175) 
(SEQ ID. NO:176) 
(SEQ ID. NO:177) 
(SEQ ID. KO:178) 
(SEQ ID. NO;179) 
(SEQ ID. NO:lS0) 
(SEQ ID. NO:18l) 
(SEQ ID. NO:182) 
(SEQ ID. NO:183) 
(SEQ ED. NO:184) 
(SEQ ID. NO:I85) 
(SEQ ID. NO:18*) 
(SEQ ID. NO:187) 
(SEQ ID. KChlSS) 
(SEQ ID. NO:189) 
(SEQ ID. NOa90) • 
(SEQ ID. Nai9l) * 
(SEQ ID. NO:192) 
(SEQ ID. NO:193) 
(SEQ ED. KO:194) 
(SEQ ID. NO;195) 
(SEQ ID. NO:196) 
(SEQ ID. NO:i97) 
(SEQ ID. NO:198) 
(SEQ ID. NO:199) 
(SEQ ID. KO^OO) 
(SEQ ID. KO:201) 
(SEQ ID. KO202) 
(SEQ ID. NO:246) 
(SEQ ID. NCh247) 
(SEQ ID. N0^48) 
(SEQ ID. N0249) 
(SEQ ID. NO250) 
(SEQ ID. NO-^51) 
(SEQ ID. K0252) 
(SEQ ID. NO-^53) 
(SEQ ID. K0254) 
(SEQ ID.N0^55) 
(SEQ HX NOt25Q 
(SEQID.KO-^57) 
(SEQ ID. N025S) 
(SEQ ID. N0^59) 
(SEQ ID. NO560) 
(SEQ HX N026I) 
(SEQ ID. N0262) 
(SEQ ID. K0263) 
(SEQID.K0264) 
(SEQID.N0^65) 
(SEQ ID. N0266) 
(SEQ ID. KO-J57) 
(SEQ ID. N0^6S) 
(SEQ ID. N0269) 
(SEQ ID. KO270) 
(SEQID.KOa71) 
(SEQ ID. K0272) 
(SEQ ED. NO-.273) 
(SEQ ID. NCh274) 
(SEQID.N0575) 
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-continued 



10 12 DIJGTCGTTAGTTCG 

11 12 D L3TAGTTGOO AGTT 

12 12 DL3GGAGTTOATAGTG 

13 12 DL3ATAGTGTGTAGTT 

14 12 DL3GTFIAGTTG ACGT 

15 12 D L3TGACGTTG AGGT 

16 12 DL3CGTTGAGGTTTA 

5 13 DUTATAACATGCCAT 

6 13 DL3AACATGCCATGGT 

7 13 DL30CATGGTA3TAT 

8 13 DL3ATTTATGAACTGG 

9 13 DUAACTGCTGGACAT 

10 13 DL3TGOACATCATGTA 



(SEQ ID. KOJ33) 5 16 

(SEQ ID. N0234) 6 16 

(SEQ ID. K0235) 7 16 

(SEQ ID. N0236) 8 16 

(SEQ ID: N0237) 9 16 

(SEQ ID. N0238) 10 16 

CSEQ ID. KOr239) • 11 16 

(SEQ ID. NO240) 12 16 

(SEQ ID. N0241) 13 16 

(SEQ ID. K0242) 14 16 

(SEQ ID. N0243) 15 16 

CSEQ ID. NO:244) 16 16 
(SEQ ID. N0245) 



DL3TTGGAAAAAGGT (SEQ ID. NO:27d) 

DUAAAAGGTTCCTC (SEQ ID. N0277) 

D L3GGTTCCTGTn"A (SEQ ID. NO:278) 

DL3CCnmTAGTXTC (SEQ ID. N0279) 

DL3TTA01UU 1 11 1 . (SEQ ED. N02SO) 

DL3CnTTTCAGAAAT (SEQ ID. NO:281) 

DL3AGAAATTGAGGTG (SEQ ID. N02S2) 

D L3 AAATTG AGGTGGT (SEQ ID, NO:2SJ) 

DL3GGTGGTAATCGT (SEQ ID. NO:284) 

DL3TAATCGTGGGTT (SEQ ID. NO:28i) 

DL3CTGGGTTTCGAT (SEQ ID. N0286) 

DL3GGTTTCGATTCT (SEQ ID. NO:287) 



No probes were present in positions X, Y-0, 12 to X, Y-4, 
12; X, Y-0, 13 to X, Y-4, 13; X, Y-0, 14 to X, Y-4, 14; X, 
Y-0, 15 to X, Y-4, 15; X, Y-0, 16 to X, Y-4, 16; The length 
of each of the probes on the chip was variable to minimize 
differences in melting temperature and potential for cross- 
hybridization. Each position in the sequence is represented 
by at least one probe and most positions are represented by 
2 or more probes. As noted above, the amount of overlap 
between the oligonucleotides varies from probe to probe. 
FIG. 9 shows the human mitochondrial genome; "O^" is the 



15 



and in several cases, the differences were within noise levels. 
Improvements can be realized by increasing the amount of 
overlap between probes and hence overall probe density 
and, for duplex DNA targets, using a second set of probes, 
either on the same or a separate chip, corresponding to the 
second strand of the target FIG. 14, in sheets 1 and 2, shows 
a plot of normalized intensities across rows 10 and 11 of the 
array and a tabulation of the mutations detected. 

FIG. 15 shows the discrimination between wild-type and 
mutant hybrids obtained with this chip. The median of the 



H strand origin of replication, and arrows indicate the cloned 25 six normalized hybridization scores for each probe was 
unshaded sequence. taken. The graph plots the ratio of the median score to the 
DNA was prepared from hair roots of six human donors normalized hybridization score versus mean counts. On this 
(mtl to mt6) and then amplified by PCR and cloned into graph, a ratio of 1.6 and mean counts above 50 yield no false 
M13; the resulting clones were sequenced using chain positives, and while it is clear that detection of some mutants 
terminators to verify that the desired specific sequences were 30 ca a t> e improved, excellent discrimination is achieved, con- 
present. DNA from the sequenced M13 clones was amplified . sidering the small size of the array. FIG. 16 illustrates how 
by PCR, transcribed in vitro, and labeled with fluorescein- . ^entity of the base mismatch may influence the ability 
UTP using T3 RNApoly merase. The 1.3 kb RNA transcripts - 10 discriminate mutant and wad-type sequences more than 
were fragmented and hybridized to the chip. The results *c POsiUoa of the mismatch ^ within an ohgonucleotide 

showed that each different individual had DNA that pro- 35 f ™ e ^ m ? lch ^ IlI0a 15 fP«»«l " % °J P ro * e 
dn«H n „n,w fc,*rM;™t?«n fin^mrinf «n the m« ,nA, t IcD S th from the 3 -««*• The base change is indicated on the 



duced a unique hybridization fingerprint on the chip and that 
the differences in the observed patterns could be correlated 
with differences in the cloned genomic DNA sequence. The 
results also demonstrated that very long sequences of a 
target nucleic acid can be represented comprehensively as a 40 
specific set of overlapping oligonucleotides and that arrays 
of such probe sets can be usefully applied to genetic analy- 
sis. 



graph. These results show that the DNA chip increases the 
capacity of the standard reverse dot blot format by orders of 
magnitude, extending the power of that approach many fold 
and that the methods of the invention are more efficient and 
easier to automate than gel-based methods of nucleic acid 
sequence and mutation analysis. 

These advantages become more apparent as chips with 
more and more probes are employed. To illustrate, the 
present invention provides a DNA chip for analyzing human 



The sample nucleic acid was hybridized to the chip in a 

solution composed of 6xSSPE, 0.1% Triton-X 100 for 60 45 mitochondrial DNA (mtDNA) that" "tiles** through 648 

minutes at 15° C. The chip was then scanned by confocal nucleotides of human H strand mtDNA from positions 

scanning fluorescence microscopy. The individual features 16280 to 356. The probes in the array are 15 nucleotides in 

on the chip were 588x588 microns, but the lower left 5x5 knglh, and each position in the target sequence is rcpre- 

square features in the array did not contain probes. To sentedbyasetof4probcs(A,QG,Tsubstirudons),which 

quantitate the data, pixel counts were measured within each 50 differed from one another at position 7 from theS'-end The_ 



synthesis site. Pixels represent 50x50 microns. The fluores- 
cence intensity for each feature was scaled to a mean 
determined from 27 bright features. After scanning, the chip 
was stripped and rehybridized; all six samples were hybrid- 
ized to the same chip. FIG. 10 shows the image observed 55 256xis?7 
from the mt4 sample on the DNA chip. FIG. U shows the 
image observed from the mt5 sample on the DNA chip. FIG. 
12 shows the predicted difference image between the mt4 

and mt5 samples on the DNA chip based on mismatches _ 

between the two samples and the reference sequence (see 60 sequences using primers tagged *with T3 and T7 RNA 
Anderson et al., 1981, Nature 290: 457-465, incorporated polymerase promoter sequences and in vitro transcription to 
herein by reference). FIG. 13 shows the actual difference produce fluorescein-UTP labeled RNA. The RNAwasfrag- 
image observed. mealed and hybridized to the oligonucleotide array in a 

The results show that, in almost all cases, mismatched solution composed of 6xSSPE, 0.1% Triton X-100 for 60 
probe/target hybrids resulted in lower fluorescence intensity 65 minutes at 18° C. Unhybridized material was washed away 
than perfectly matched hybrids. Nonetheless, some probes with buffer, and the chip was scanned at 25 micron pixel 
detected mutations (or specific sequences) better than others, resolution. 



array consists of 13 blocks of 4x50 probes: each block scans 
through 50 nucleotides of contiguous mtDNA sequence. The . 
blocks are separated by blank rows. The 4 comer columns 
contain control probes; there are a total of 2600 probes in a 
1.28 cmxl.28 cm square area (feature), and each area is 
microns. 

Labeled RNA target DNA was prepared by PCR ampli- 
fication of a 13 kb region of human mtDNA spanning 
positions 15935 to 667, cloning into M13 (sequence verifi- 
cation was performed), and reampliflcation of the cloned 
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FIG. 17 provides a 5' to 3* sequence listing of one target between the particular mutation in p53 and the functioning 
corresponding to the probes on the chip. X is a control probe. of the resulting protein. Furthermore, there are projects 
Positions that differ in the target (i.e., are mismatched with looking at the gcrmline inheritance of p53 mutations and the 
the probe at the designated site) are in bold. FIG. 18 shows development of cancer. The present invention provides 
the fluorescence image produced by scanning the chip when 5 useful DNA chips and methods for such studies, 
hybridized to this sample. About 95% of the sequence could , -In addition, the present invention also provides a diag- 
- be read correctly from only one strand of the original duplex nostic test kit and method and p53 probes immobilized on a 
" target nucleic acid. Although some probes did not provide DNA chip in an organized array. Currently available diag- 
exceilent discrimination and some probes did not appear to nostic tests for cancer typically have a sensitivity of about 
hybridize to the target efficiently, excellent results were 10 50%. The present invention provides significant advantages 
achieved. The target sequence differed from the probe set at over such tests, and in one embodiment provides a method 
six positions: 4 transitions and 2 insertions. All 4 transitions for detecting cancer-causing mutations in p53 that involves 
were detected, and specific probes could readily be incor- the steps of (1) obtaining a biopsy, which is optionally 
porated.into the array to detect insertions or deletions. FIG. fractionated by cryostat sectioning to enrich tumor cells to 
19 illustrates the detection of 4 transitions in the target 15 about 80% of the total cell population. The DNA or RNA is 
sequence relative to the wild-type probes on the chip. then extracted, amplified, and analyzed with a DNA chip for 

These results illustrate that longer sequences can be read the presence of p53 mutations correlated with malignancy, 
using the DNA chips and methods of the invention, as To illustrate the value of the DNA chips of the present 
compared to conventional sequencing methods, where read- invention in such a method, a DNA chip was synthesized by 
ing length is limited by the resolution of gel electrophoresis. 20 the VLSIPS™ method to provide an array of overlapping 
Similar results were observed when genomic DNA samples probes which represent or tile across a 60 base region of 
were prepared from human hair roots. Hybridization and exon 6 of the p53 gene. To demonstrate the ability to detect 
signal detection require less than an hour and can be readily substitution mutations in the target, twelve different single 
shortened by appropriate choice of buffers, temperatures, substitution mutations (wild type and three different substi- 
probes, and reagents. In principle, longer sequence reads can 25 tutions at each of three positions) were represented on the 
be obtained than by conventional sequencing, where reading chip along with the wild type. Each of these mutations was 
length is limited by the resolution of gel electrophoresis. represented by a series of twelve 12-mer oligonucleotide 
P53 Sequencing and Diagnostic DNA Chips probes, which were complementary to the wild type target 

P53 is a tumor suppressor gene that has been found to be except at the one substituted base. Each of the twelve probes 
mutated in most forms of cancer (see Levine et al, 1991,. 30 was complementary to a different region of the target and 
Nature 351: 453-456, and Hoilstein et al., 1991, Science contained the mutated base at a different position, e.g., if the 
253: 49-53, each. of. which is incorporated herein by substitution was at base 32, the set of probes would be 
reference). In addition, there is a hereditary syndrome, . complementary-with the exception of base 32— to regions 
Li-Fraumeni, in which individuals inherit mutant alleles of of the target 21-32, 22-33, and 32-43). This enabled inves- 
p53 and tend to have cancer at relatively young ages 35 tigation of the effect of the substitution position within the 
(Frebourg et al, 1992, PNAS 89: 6413-6417, incorporated probe. The alignment of some of the probes with a 12-mer 
herein by reference). During the development of a cancer, model target nucleic acid is shown in FIG. 20. 
p53 is inactivated. The course of p53 inactivation generally To demonstrate the effect of probe length, an additional 
involves a mutation in one copy of p53 and is often followed series of ten 10-mer probes was included for each mutation 
by deletion of the other copy. After p53 is inactivated, 40 (see FIG. 21). In the vicinity of the substituted positions, the 
chromosomal abnormalities begin to appear in tumors. In wild-type sequence was represented by every possible over- 
the best understood form of cancer, colorectal cancer, well lapping 12-mer and 10-mer probe. To simplify comparisons, 
over 50%, perhaps 80%, of all patients with tumors have p53 the probes corresponding to each varied position were 
mutations. In addition, p53 mutations have been found in a arranged on the chip in the rectangular regions with the 
high proportion of lung, breast, and other tumors (Rodrigues 45 following structure: each row of cells represents one 
et aL, 1990, PNAS 87: 7555-7559, incorporated herein by substitution, with the top row representing the wild type, 
reference). According to data presented by David Sidransky Each column contains probes complementary to the same 
(1992San Diego Conference), over 400 mutations in p53 are region of the target, with probes complementary to the 
known. 3'-end of the target on the left and probes complementary to 

The p53 gene spans.20kbp in humans and has 11 exons, 50 the 5'-end of the target on the right. The difference between 
10 of which arc protein coding (see Tominaga et al, 1992, ' two adjacent columns is a single base shift in the positioning 
Critical Reviews in Oncogenesis 3: 257-282, incorporated ■ ■ of the probes. Whenever possible, the series of 10-mer 
herein by reference). The gene produces a 53 kilodalton . probes were placed in four rows immediately underneath 
phosphoprotcin that regulates DNA replication. The protein and aligned with the 4 rows of 12-mer probes for the same 
acts to halt replication at the Gl/S boundary in the cell cycle 55 mutation. 

and is believed to act as a "molecular policeman," shutting To provide model targets, 5* fluoresceinated 12-mers 
down replication when the DNA is damaged or blocking the containing all possible substitutions in the first position of 
reproduction of DNA viruses (see Lane, 1992, Nature 358: codon 192 were synthesized (see the starred position in the 
15-16, incorporated herein by reference). There is substan- target in FIG. 20). Solutions containing 10 nM target DNA 
rial interest in the cancer research community in analyzing 60 in 6xSSPE, 0.25% Triton X-100 were hybridized to the chip 
p53 mutations. The NCI is currently funding contracts to at room temperature for several hours. While target nucleic 
characterize the p53 mutation spectra caused by various was hybridized to the chip, the fluorophores on the chip were 
carcinogens. In addition, there are research projects which excited by light from an argon laser, and the chip was 
involve sequencing p53 from spontaneously arising tumors. scanned with an autofocusing confocal microscope. The 
A major resource in these studies is the huge supply of 65 emitted signals were processed by a PC to produce an image 
biopsy material stored in paraffin blocks. Also, there are using image analysis software. By 1 to 3 hours, the signal 
projects which are aimed at analyzing the relationship had reached a plateau; to remove the hybridized target and 
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allow hybridization to another target, the chip was stripped For sequencing, the p53 DNA can be cloned from the 
with 60% formamide, 2xSSPE at 17° C. for 5 minutes. The sample or directly amplified from genomic DNA by PCR. If 
washing buffer and temperature can vary, but the buffer genomic PCR is used, then the DNA can be diluted prior to 
typically contains 2-to-3xSSPE, lO-to-60% formamide (one - amplification so that a single copy of the gene is amplified, 
can use multiple washes, increasing the formamide concen* 5 For diagnostic purposes, the genomic DNA can be isolated 
tration by 10% each wash, and scanning between washes to - from a tumor biopsy in which the tumor cells may be the 
■ determine when the wash is complete), and optionally a majority population. As noted above, the proportion of, 
small percentage of Triton X-100, and the temperature is* tumor cells in a sample can be enriched by cryostat section- 
typically in the range of 15° to 18° C. ing. DNA can also be isolated and amplified from tumor 

Very distinct patterns were observed after hybridization 10 samples stored in paraffin blocks, 
with targets with 1 base substitutions and visualization with lie p53 DNA in the sample can be amplified by PCR 
a confocal microscope and software analysis, as shown in (although other amplification methods can be used) using 
FIG. 22. In general, the probes which form perfect matches 3-4 primer pairs generating amplicons of <3 kbp each, 
with the target retain the highest signal. For example, in the Illustrative primers of the invention for amplifying exon 5 of 
first image in Figure PC, the 12-mer probes that form perfect 15 the p53 gene are shown below (B is biotin; F is fluorescein), 
matches with the wild-type (WT) target are in the first row 5'-B-CACTTGTGCCCTGACTTTCAAC-3'(SEQ. ID 
(top). The 12-mer probes with single base mismatches are NO:2£8) 

located in the second, third, and fourth rows and have much 5 , -F-CACTTGTGCCCTGACTTTCAAC-3 , 
lower signals. The data is also depicted graphically in FIG. 5-ATGCAAITAACCCrCACTAAAGGGAGACACTTG- 
23.0neachgraph,theXordinateisthepositionoftheprobe 20 TGCCCTGACTTTCAAC-3*(SEQ. ID NO:289) (hasT3 
in its row on the chip, and the Y ordinate is the signal at that promoter) 

probe site after hybridization. S'-B-GACCCTGGGCAACCAGCCCTGTCGT-SXSEQ. ID 

When a target with a different one base substitution is NO:290) 
hybridized the complementary set of probes has the highest S'-F-GACCCTGGGCAACCAGCCCTGTCGT-S' 
signal (see pictures 2, 3, and 4 in FIG. 22 and graphs 2, 3, 25 5-TAATACGACTCACTATAGGGAGGACCCTGGGCA- 
and 4 in FIG. 23). In each case, the probe set with no ACCAGCCCTGTCGT-3'(SEQ. ID NO:291) (has T3 
mismatches with the target has the highest signals. 'Within a promoter) 

12-mer probe set, the signal was highest at position 6 or 7. After PCR amplification of the target (the amplified target is 
The graphs show that the signal difference between 12-mer called the "amplicon") one strand of the amplicon can then 
probes at the same X ordinate tended to be greatest at 30 be isolated, i.e., using a biotinylated primer that allows 
positions 5 and 8 when the target and the complementary capture of the undesired strand on streptavidiri beads. 
. probes formed 10 base pairs and 11 base pairs, respectively. Alternatively, asymmetric PCR can be used to generate a 
Because tumors often have both WT and mutant p53 genes, single-stranded target. Another approach involves the gen- 
mixed target populations were also hybridized to the chip, as eration of single stranded RNA form the PCR product by 
shown in FIG. 24. When the hybridization solution consisted 35 incorporating a T7 or other RNA polymerase promoter in 
of a 1:1 mixture of WT 12-mer and a 12-mer with a one of the primers. The single-stranded material can option- 
substitution in position 7 of the target, the sets of probes that ally be fragmented to generate smaller nucleic acids with 
were perfectly matched to both targets showed higher sig- less significant secondary structure than longer nucleic 
nals than the other probe sets. acids. 

The hybridization efficiency of a 10-mer probe array as 40 In one such method, fragmentation is combined with 
compared to a 12-mer probe array was also compared. The labeling. To illustrate, degenerate 8-mers or other degenerate 
10-mer and 12-mer probe arrays gave comparable signals short oligonucleotides are hybridized to the single-stranded 
(see graphs 1-4 in FIG. 23 and graphs 1-4 in tlG. 25). target material In the next step, a DNA polymerase is added 
However, the 10-mer probe sets, which are in rows 5-8 (see with the four different dideoxynucleolides, each labeled with 
images in FIG. 22), seemed to be better in this model system 45 a different fluorophore. Fluorophore -labeled dideoxynucle- 
than the 12-mer probe sets at resolving one target from otide are available from a variety of commercial suppliers, 
another, consistent with the expectation that one base mis- such as ABI. Hybridized 8-mers are extended by a labeled 
matches are more destabilizing for 10-mers than 12-mens. dideoxynucleotide. After an optional purification step, ie., 
Hybridization results within probe sets perfectly matched to with a size exclusion column, the labeled 9-mers are hybrid- 
target also followed the expectation that, the more matches 50 ized to the chip. Other methods of target fragmentation can 
the individual probe formed with the target, the higher the be employed. The single-stranded DNA can be fragmented - 
signal However, duplexes with two 3' dangles (see FIG. 23, by partial degradation with a DNAsc or partial depurination 
position 6 in graphs 1-4) have about as much signal as the with acid Labeling can be accomplished in a separate step, 
probes which are matched along their entire length (see FIG. ie., fluorophore -labeled nucleotides are incorporated before 
23, position 7, in graphs 1-4). 55 the fragmentation step or a DNA binding fluorophore, such 

This illustrative model system shows that 12-mer targets as ethidium homodimcr, is attached to the target after 
that differ by one base substitutions can be readily distin- fragmentation. 

guished from one another by the novel probe array provided In one embodiment, the DNA chip has an array of 10 4 to 
by the invention and that resolution of the different 12-mer 10 5 probes tiling across the protein coding regions of p53, 
targets was somewhat better with the 10-mer probe sets than 60 which comprise about 1200 bp; smaller arrays specific for 
with the 12-mer probe sets. The value of having several the 600 bp mutational hot spot region are also useful. The 
overlapping probes hybridizing to a target demonstrates the probes overlap for N-2 to N-4 bases, where N is the length 
value of the multiple hybridization events that take place on of the probe in bases. N is typically 10 to 14 bases long, but 
a DNA chip of the invention. The results also demonstrate as will be seen below, probes 15 to 19 bases and longer are 
the feasibility of constructing a probe set to sequence the 65 also useful. Every possible single base substitution occur- 
entire 1.4 kbp protein coding region of p53 or alternatively ring one at a time is represented in the array. The number of 
the 0.6 kbp of exons 5-9 containing mutation hot spots. unique 10-mer probes with 7 base overlaps would be about 
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(120Q/3)x4xlO or about 1.6xl0 4 . To allow 3 replicates of of DNA. First, the target DNA is amplified by PCR with 

. each probe, one might have a total array size on the order of primers allowing easy ligation into a vector, which is taken 

4.8x10* probes. Of course, arrays of probes within ihc up by transformation of E. coU which in turn must be 

ranges of lO 2 to 10 4 probes are also useful for applications; cultured, typically on plates overnight. After growth of the 

for example, very large arrays of 10" or more probes are 5 bacteria, DNA is purified in a procedure that typically takes 

. useful for sequencing or sequence checking large genomic . . about 2 hours; then, the sequencing reactions are performed, 

. DNA fragments. Optionally fragmented and labeled target which takes at leasl [another hour .and I the samples .are run on 

nucleic acid hybridized to the chip is detected by a confocal ! he J? 1 ho f u ^ * e durat,0 " < *P"*«W °" he 

microscope or other imaging device. The pattern of sites len S ,h ° f thc ^ agm ".' '? J? fJT^ fl r* 

* „ .... . b . 6 , , r . ... „ present invention provides direct analysis of the PCR amph- 

"lighting up' with target is preferably analyzed with com- 10 fa materia] ift * brief -transcription and fragmentation 

puter assistance to provide the sequence of the Urget from Mvi ^ of ^ an(J ^ 

the pattern of sites produang signals. M mteresmJg clinica i application for the characterization 
The invention is illustrated below wth examples of DNA rf heterozvgous mutations wiu, DNA ch ; ps ^ « folIows . 

chips comprising very large arrays of DNAprobes to rese- Indiv j duals ^ ^ cancef mutations have a very ^ 

quence P 53 target nucleic acid in a sample. To analyze is ^ fo( . tumors by j^d;.,,.™. 

DNA from exon 5 of the p53 tumor suppressor gene, a set Abou[ 1Q% of a „ CMCer ^ haye ^ ^ 

of overlapping 17-mer probes was syntheazed on a chip. ^ for 53 Qf omef tumof Thus, 

The probes for the WT allele were synthesized so as to tile deddin 0Q a Ueatment modality a phys i cian could usc , ne 

across the entire exon with single base overlaps between method ^ DNA chJ of ^ , 0 ^ fof a 

probes. For each WT probe, a sets of 4 additional probes, 20 ^j.^ r gene mutation- 

one for each possible base substitution at position 7, were * NA ^ Rat - onal Management . 

s^mesizedandplaadmacolumnrela The t invention ^ provides DNA chips thatcan 

Exon 5 DNA was amplified by PCR with pnmers flanking be ^ fe p hysic i a ns to determine optimum therapeutic 

the exon. One of the pnmers was labeled with fluorescein; ^ fc eafl a detectfon rf b j olo ^ cally mediated 

the other pnmer was labeled with biotin. After amplification, 25 f esisUnce t0 a therapeuU - c agenl m a variety of disease states, 

the biotinylated strand was removed by binding to strepta- ^ rf such DNA ^ m ^ ^ cW ^ 

yidui beads. The fluoresceinated strand was used in hybrid- he , physiciaQS recognize heaUh C are cost savings, achieve 

aatl . 0n ' . r . . . , . .. rapid therapeutic benefits, Umit administration of ineffective 

About A of the amplified single-stranded nucleic acid (<Jue ^ res ; stance) yet toxic drugs> monitor m 

was hybndaed overnight m SxSSPE at 60° C lo the probe 30 ^ Stance, and decrease pathogen acquisition of 

chip (under a cover slip). After washmg with 6xSSPE, the rcs[sUnce a ppii ca dons include the treatment of 

chip was scanned using confocal microscopy. FIG, 26 shows HIY> othef an< j cancer, 

an image of the p53 chip hybridized to the Urget DNA. mV has infected a large and expanding number of people. 

Analysisof the uitensuydaU showed tiat 93.5%of the 1*4 ^ fflassive ^ B nditures . nfy can 

bases of exon 5 were caUed «> agreement with the WT 35 idl ^ ^ d to ^ , he j,,^^ 

sequence (see Buchman et al.. 1988, Gene 70: 245-252, rimaril due to ^ acaoa of we heterodimer ic protein (51 

uicorporated herein by reference). The miscalled bases were ^ Md fi6 j - ^ revcrse transcriptase (RT) encoded by 

from positions where probe signal intensities were tied ^ 2 ? kfe , ^ h - h em)r fate (5 _ lQ round) of 

(1.6%) and where non-WT probes had the highest signal fte RT , n fc be]ieved , 0 accounl for ue hypeniIllWb ni^ 

intensity (4.9%). RG. 27 illustrates how tie actual sequence 40 of m ^ DUC , eos ; de analogues> act, ddl, ddC, and 

was read. Gaps in the sequence of letters in the WT rows d4T commonl ^ to treat HIV infection are converted to 

correspond to control probes or sites. Positions at which auckotide analogues by sequenual phosphorylation in the 

bases are miscalled are represented by letters untalic type in lasm of ^ wherc ^^^a of lhe 

cells corresponding to probes in which the WT bases have analogue ut0 , he ^nl DNA results in termination of viral 

been substituted by other bases. 45 replication, because the 5'— 3* phosphodiester linkage can- 

As the diagram indicates, the miscalled bases are from the Mt ^ completed> However, within after 6 months to 1 year 

low intensity areas of the image, which may be due to of trMtmentj mv , ; calI mutates ^ rj gene M M to 

secondary stnjcture in the target or probes preventmg inter- beC(jme kc ble of incorporating me analogue and so 

molecular hybridization. To diminish the effects due to ^j^, , 0 Uatiaat Several ^,0^, mutetions m shown 

secondary structure, one can employ shorter targets ;(i.e., by 50 jn f orm be i ow . 
Urget fragmentation) or use more stringent hybridization 

conditions. In addition, the use of a set of probes synthesized ; j_ ■ ■ : • " ' 

by tiling across the other strand of a duplex target can also kt mutations associated with drug resistance 

provide sequence information buried in secondary structure 

in the other strand. It should be appreciated, however, that 55 ANTI- ^ 

the pattern of low intensity areas that forms as a result of vr*AL CODON «a CHANGE at CHANGE 
secondary structure in the target itself provides a means to 
identify that a specific target sequence is present in a sample. 
Other factors that may contribute to lower signal intensities 
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stabilities. 

These results demonstrate the advantages provided by the 

DNA chips of the invention to genetic analysis. As another ^ 

example, heterozygous mutations are currently sequenced N B ^ mu&liou ^uuacc to other drag* in vitro 
by an arduous process involving cloning and rcpurification 65 
of DNA. The cloning step is required, because the gel The present invention provides DNA chips for detecting 

sequencing systems are poor at resolving even a 1:1 mixture the multiple mutations in the HIV RT gene associated with 
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resistance to different therapeutics. These DNA chips will 
enable physicians to monitor mutations over time and to 
change therapeutics if resistance develops. The DNA chip 
will provide redundant confirmation of conserved HIV RT 
and other gene sequences, and the probes on the chip will tile 
through, with overlap, in important mutational hot spot 
regions. The chip will optionally have probes that span the 
entire "coding region of the RT and optionally the genes for 
other HIV proteins, such as coat proteins. HIV target nucleic 
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to gain primary structure information of the DNA target. 
This format has important applications in sequencing by 
hybridization, DNA diagnostics and in elucidating the ther- 
modynamic parameters affecting nucleic acid recognition. 

Conventional DNA sequencing technology is a laborious 
procedure requiring electrophoretic size separation of 
labeled DNA fragments. An alternative approach, termed 
Sequencing By Hybridization (SBH), has been proposed 
(LysovetaL, 1988,DoklAkad.NaukSSSR 303: 1508-1511; 



acid can be isolated from blood samples (peripheral blood 10 Bains et al., 1988, /. Theon Biol 135: 303-307; and 
lymphocytes or PBMQ and amplified by PCR, primers for Drmanac et al., 1989, Genomics 4: 114-128, incorporated 
which are shown in the table below. herein by reference). This method uses a set of short 



ANfPLrFICATTON' OF TARGET 



TARGET 

SKE PRIMER 1 PRIMER 2 



1, 742bp GTAGAATTCTGTTGACTCAGATTGG OATAAGCirGGGOCTTATCIATTOCAT 

CSEQ ID. NO J292) (SEQ ID. NO:294) 

535bp AAATCCATACAATACTCCAGTATTTGC ACOCATCCAAAGGAATGGAGoi iwi i iC 

(SEQ ID. N0^93) (SEQ ID. N0.295) 

323bp Gcobank#KD2013 1839-1908 bases 2211-21M 



TheHIVRTgenechipsoftheinvenuon,asweUastheCF f oligonucleotide probes of defined sequence to search for 
mtDNA, and p53 DNA chips of the invention, illustrate the complementary sequences on a longer target strand of DNA. 
diverse application of the methods and probe arrays of the The hybridization pattern is used to reconstruct the target 
invention. The examples that Mow describe methods for DNA sequence. It is envisioned that hybridization analysis 
preparing nucleic acid targets from samples for application of Urge numbers of probes can be used to sequence long 
to the DNA chips of the invention and provide additional 30 stretches of DNA. In immediate applications of this bybnd- 

details of the methods of the invention. • . ^u'on methodology ^ 

to interrogate local DNA sequence. 
EXAMPLES* The strategy of SBH can be illustrated by the following 

I. VLSIPS™ Technology example. A 12-mcr target DNA sequence, 

As noted above, the VLSIPS™ technology is described in 35 AGCCIAGCTGAA, (SEQ. ID NO:296) is mixed with a 
a number of patent publications and is preferred for making complete set of octanucleotide probes. If only perfect 
the oligonucleotide arrays of the invention. For complementarity is considered, five of the 65,536 octamer 
completeness, a brief description of how this technology can probes -TCGGATCG, CGGATCGA, GGATCGAC, 
be used to make and screen DNA chips is provided in this GATCGACT, and ATCGACTT will hybridize to the target. 
Example and the accompanying Figures. In the VLSIPS 40 Alignment of the overlapping sequences from the nybndiz- 
method, light is shone through a mask to activate functional ing probes reconstructs the complement of the ongmal 
(for oligonucleotides, typically an —OH) groups protected 12-mer target: 

with a photoremovable protecting group on a surface of a ,. 

solid support After light activation, a nucleoside building TCGGATCG 

block, itself protected with a photoremovable protecting 45 CGGATCGA 

group (at the 5' — OH), is coupled to the activated areas of ggatogac 

the support The process can be repeated, using different ° "atoactt 

masks or mask orientations and building blocks, to prepare tcggatcgactt (seq. id ko$97) 

very dense arrays of many different oligonucleotide probes. _ 

The process is iUustjatedmHG. 28; HG. 29 fllusUates how 50 . 
the process can be used to prepare "nucleoside combinato- . ; Hybridization methodology can be earned out by attaching ' 
rials" or oligonucleotides synthesized by coupling all four * target DNA to a surface; The target is interrogated with a set 
nucleosides to form dimers, trimers, etc. of oligonucleotide probes, one at a time («^trezosk i et al 

New methods for the combinatorial chemical synthesis of 1991, Proc Natl Acad. Set. USA 88: 10089-10093, and 
peptide, polycarbamate, and oligonucleotide arrays have 55 Drmanac et al., 1993, Science 260: 1W9-1652, each of 
recently been reported (see Fodor et al., 1991, Science 251: which is incorporated herein by reference). Tn* approach 
767-773- Cfao et al, 1993, Science 261: 1303-1305; and can be implemented with well established methods of linmo- 
Southern et al., 1992, Genomics 13: 1008-10017, each of bOizau'on and hybridization detection, but involves a large 
which is incorporated herein by reference). These arrays, or number of manipulations. For example, to probe a sequence 
biological chips (see Fodor et al., 1993, Nature 364: 60 utilizing a full set of octanucleotides^tens of thousands of 
555-556, incorporated herein by reference), harbor specific hybridization reactions must be performed. Alternatively, 
chemical compounds at precise locations in a high-density, SBH can be carried out by attaching probes to a surface ui 
information rich format, and are a powerful tool for the an array format where the identity of the probes at each site 
study of biological recognition processes. A particularly is known. Tne target DNA is then added to the array of 
exciting application of the array technology is in the field of 65 probes. The hybridization pattern determined in a single 
DNAsequence analysis. Tbe hybridization pattern of a DNA experiment directly reveals the identity of all complemen- 
target to an array of shorter oligonucleotide probes is used tary probes. 
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As noted above, a preferred method of oligonucleotide 
probe array synthesis involves the use of light to direct the 
synthesis of oligonucleotide probes in high-density, minia- 
turized arrays. Photolabile S'-protected N-acyl- 
deoxynucleoside phosphoramidites, surface linker 
chemistry,, and versatile combinatorial synthesis strategies 
have been developed for this technology. Matrices of 
spatially-defined oligonucleotide probes have been 
generated, and the ability to use these arrays to identify 



of the probes will generate detectable signals. Modifying the 
above expression for N, one arrives at a relationship esti- 
mating the number of detectable hybridizations (Nd) for a 
DNA target of length Lt and an array of complexity C. 
Assuming an average of 5 positions giving signals above 

background: Nd-(l+5(C-l))[Lt-Q^)-l)]. - 
■ Arrays of oligonucleotides can be efficiently generated by 
light-directed synthesis and can be used to determine the 
identity of DNA target sequences. Because combinatorial 



complementary sequences has been demonstrated by 10 strategies are used, the number of compounds increases 

hybridizing fluorescent labeled oligonucleotides to the DNA exponentially while the number of chemical coupling cycles 

chips produced by the methods. The hybridization pattern increases only linearly. For example, expanding the synthe- 

demonstrates a high degree of base specificity and reveals sis to the complete set of 4 s (65,536) octanucleotides will 

the sequence of oligonucleotide targets. add only four hours to the synthesis for the 16 additional 

The basic strategy for light-directed oligonucleotide syn- 15 cycles. Furthermore, combinatorial synthesis strategies can 

thesis (1) is outlined in FIG. 28. The surface of a solid be implemented to generate arrays of any desired composi- 

support modified with photolabile protecting groups (X) is tion. For example, because the entire set of dodccamers(4 *) 

illuminated through a photolithographic mask, yielding can be produced in 48 photolysis and coupling cycles (b" 

reactive hydroxyl groups in the illuminated regions. A compounds requires bxn cycles), any subset of the dodecam- 

3 f -0-phosphoramidite activated deoxynuclcoside (protected 20 ers (including any subset of shorter oligonucleotides) can be 

atthe5*-hydroxylwith a photolabile group) is then presented constructed with the correct lithographic mask design in 48 

to the surface and coupling occurs at sites that were exposed or fewer chemical coupling steps. In addition, the number of 

to light. Following capping, and oxidation, the substrate is compounds in an array is limited only by the density of 

rinsed and the surface illuminated through a second mask, to synthesis sites and the overall array size. Recent experi- 

cxpose additional hydroxyl groups for coupling. A second 25 ments have demonstrated hybridization to probes synthe- 



sized in 25 /mr sites. At this resolution, the entire set of 
65,536 octanucleotides can be placed in an array measuring 
0.64 cm square, and the set of 1,048,576 dodccanuclcotides 
requires only a 2.56 cm array. 

Genome sequencing projects will ultimately be limited by 
DNA sequencing technologies. Current sequencing method- 
ologies are highly reliant on complex procedures and require 
substantial manual effort Sequencing by hybridization has 
the potential for transforming many of the manual efforts 



5-protected, 3'-0-phosphoramidite activated deoxynucleo- 
side is presented to the surface. The selective photodepro- 
tection and coupling cycles are repeated until the desired set 
of products is obtained. 

light directed chemical synthesis lends itself to highly 30 

efficient synthesis strategies which will generate a maximum ■ 
number of compounds in a minimum number of chemical 
steps. For example, the complete set of 4ri polynucleotides 

(length n), or any subset of this set can be produced in only . _ a . 
4xn chemical steps. See FIG. 29. The patterns of illumina- 35 into more efficient and automated formats. Ught-direcled 
tion and the order of chemical reactants ultimately define the synthesis is an efficient means for large scale production of 
products and their locations. Because photolithography is miniaturized arrays for SBH. The oligonucleotide arrays are 
used, the process can be miniaturized to generate high- not limited to primary sequencing applications. Because 
density arrays of oligonucleotide probes. For an example of single base changes cause multiple changes in the faybrid- 
the nomenclature useful for describing such arrays, an array 40 ization pattern, the oligonucleotide arrays provide a power- 
containing all possible octanucleotides of dA and dT is ful means to check the accuracy of previously elucidated 
written as (A+1) 8 . Expansion of this polynomial reveals the DNA sequence, or to scan for changes within a sequence. In 
identity of all 256 octanucleotide probes from AAAAAAAA the case of octanucleotides, a single base change in the target 
to T ' l ' l ' l ' J ' i ' i ' i : A DNA array composed of complete sets of DNA results in the loss of eight complements, and generates 
dinucleotides is referred to as having a complexity of 2. The 45 eight new complements. Matching of hybridization patterns 
array given by (A+T+Cf G)8 is the full 65,536 octanucle- may be useful in resolving sequencing ambiguities from 
otide array of complexity four. standard gel techniques, or for rapidly detecting DNAmuta- 
To carry out hybridization of DNA targets to the probe tional events. The potentially very high information content 
arrays, the arrays are mounted in a thermostatically con- of light-directed oligonucleotide arrays will change genetic 
trolled hybridization chamber. Fluorescein labeled DNA 50 diagnostic testing. Sequence comparisons of hundreds to 
targets arc injected into the chamber, and hybridization is thousands of different genes will be assayed simultaneously 
allowed to proceed for H to 2 hours. The surface of the instead of the current one,.or few at a time format...Gustom 
matrix is scanned in an epifluorescence microscope (Zeiss arrays can also be constructed to contain genetic markers for 
Axioscop 20) equipped with photon counting electronics the rapid identification of a wide vanety of pathogenic 
using 50-100 /*W of 488 nm excitation from an Argon ion 55 organisms. 



Oligonucleotide arrays can also be applied to study the 
sequence specificity of RNA or protein-DNA interactions. 
Experiments can be designed to elucidate specificity rules of 
non Watson-Crick oligonucleotide structures or to investi- 
60 gate the use of novel synthetic nucleoside analogs for 
antisense or triple helix applications. Suitably protected 
RNA monomers may be employed for RNA synthesis. The 
oligonucleotide arrays should find broad application deduc- 

_ _ _ ing the thermodynamic and kinetic rules governing forma- 

probes on the arrayFor example*, for an 11-mer hybridized 65 tion and stability of oligonucleotide complexes, 
to an octanucleotide array, N-4. Hybridizations with mis- Other than the use of photoremovable protecting groups, 
matches at positions that are 2 to 3 residues from either end the nucleoside coupling chemistry is very similar to that 



laser (Spectra Physics model 2020). All measurements are 
acquired with the target solution in contact with the probe 
matrix. Photon counts are stored and image files are pre- 
sented after conversion to an eight bit image format. See 
RG. 33. 

When hybridizing a DNA target to an oligonucleotide 
array, N-Lt-(Lp-l) complementary hybrids are expected, 
where N is the number of hybrids, Lt is the length of the 
DNA target, and Lp is the length of the oligonucleotide 
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used routinely today for oligonucleotide synthesis. FIG. 30 
shows the deprotection, coupling, and oxidation steps of a 
solid phase DNA synthesis method. FIG. 31 shows an 
illustrative synthesis route for the nucleoside building blocks 
used in the method. FIG. 32 shows a preferred photoremov- 
able protecting group, MeNPOC, and how to prepare the 
group in active form. The procedures described below show 
how-to prepare these reagents. The nucleoside building 
blocks are 5'-MeNPOC-THYMIDINE-3'-OCEP; 
5*-MeNPOC-N 4 -t-BUTYL PHENOXYACETYL- 
DEOXYCYnDINE-3'-OCEP; 5'-MeNPOC-N 4 -t-BUTYL 
PHENOXYACETYL-DEOXYGUANOSINE-3'-OCEP; 
and 5'-McNPOC-N 4 -t-BUTYL PHENOXYACETYL- 
DEOXYADENOSINE-S'-OCEP. 

A. Preparation of 4, 5-methylenedioxy-2-nitroacetophenone 15 




10 



20 



30 



minimum volume of CH^Clj or THF(-175 ml) and then 
precipitating it by slowly adding hexane (1000 ml) while 
stirring (yield 51 g; 80% overall). It can also be recrystal- 
lized (eg., tolucne-hexane), but this reduces the yield. 
C. Preparation of l-(4,5- mcthylenedioxy-2-nitrophenyI) 
ethyl chloroformate (MeNPOC-Cl) 



Toluene/THF*^ 





A solution of 50 g (0.305 mole) 3,4- 25 
methylenedioxyacetophenone (Aldrich) in 200 mL glacial 
acetic acid was added dropwise over 30 minutes to 700 mL 
of cold (2-4° C.) 70% HNO3 with stirring (NOTE: the 
reaction will overheat without external cooling from an ice 
bath, which can be dangerous and lead to side products). At 30 
temperatures below 0° C, however, the reaction can be 
sluggish. A temperature of 3°-5°.C. seems to be optimal). 
The mixture was left stirring for another 60 minutes at 3°-5° 
C, and then allowed to approach ambient temperature. 
Analysis by TLC (25% EtOAc in hexane) indicated com- 35 
pletc conversion of the starting material within 1-2 hr. When 
the reaction was complete, the mixture was poured into -3 
liters of crushed ice, and the resulting yellow solid was 
filtered off, washed with water and then suction-dried. Yield 
-53 g (84%), used without further purification. 40 
B. Preparation of l-(4,5-MethyIenedioxy-2-nitrophenyl) 
ethanol 



NOj 



OH 




45 



Phosgene (500 mLof 20% w/v in toluene from Fluka: 965 
mmole; 4 eq.) was added slowly to a cold, stirring solution 
of 50 g (237 mmole; 1 eq.) of l-(4,5-methylenedioxy-2- 
nitrophenyI)ethanol in 400 mL dry THE The solution was 
stirred overnight at ambient temperature at which point TLC 
(20% E^O/hexane) indicated >95% conversion. The mix- 
ture was evaporated (an oil-less pump with downstream 
aqueous NaOH trap is recommended to remove the excess 
phosgene) to afford a viscous brown oil. Purification was 
effected by flash chromatography' on a short (9x13 cm) 
column of silica gel eluted with 20% Et 2 0/hexane. Topically 
55 g (85%) of the solid yellow MeNPOC-Cl is obtained by 
this procedure. The crude material has also been recrystal- 
lized in 2-3 crops from 1:1 ether/hexane. On this scale, -100 
ml is used for the first crop, with a few percent THF added 
to aid dissolution, and then cooling overnight at -20° C. (this 
procedure has not been optimized). The product should be 
stored dessicated at -20° C. 

D. Synthesis of S'-MeNPOC-y-DEOXYNUCLEOSlDE-S'- 
(N,N-DHSOPROPYL 2-CYANO ETHYL PHOSPHORA- 
MIDITES 
(1) S'-MeNPOC-Nucleosides 



Pyridine 



50 



" Sodium borbhydride (10 g; 027 mol) was added slowly 
to a cold, stirring suspension of 53 g (0.25 mol) of 4,5- 
methylenedioxy-2-nitroacetophenone in 400 mL methanol. 
The temperature was kept below 10° C. by slow addition of 55 
the NaBH 4 and external cooling with an ice bath. Stirring 
was continued at ambient temperature for another two hours, 
at which time TLC (CHJ&J indicated complete conversion 
of the ketone. The mixture was poured into one liter of 
ice-water and the resulting suspension was neutralized with 60 
ammonium chloride and then extracted three times with 400 
mL CH 2 Cl2 or EtOAc (the product can be collected by 
filtration and washed at this point, but it is somewhat soluble 
in water and this results in a yield of only -60%). The 
combined organic extracts were washed with brine, then 65 
dried with MgS0 4 and evaporated. The crude product was 
purified from the main byproduct by dissolving it in a 




McnpocO* 




Base -THYMIDINE (T); N-4-ISOBUTYRYL 
2-DEOXYCYTIDINE (ibu-dQ; N-2-PHENOXYACETYL 
2 , DEOXYGUANOSINE (PAC-dG); and N-6- 
PHENOXYACETYL 2'DEOXYADENOSINE (PAC-dA) 

All four of the 5-MeNPOC nucleosides were prepared 
from the base-protected 2'-deoxynucleosides by the follow- 
ing procedure. The protected 2 f -deoxynucleoside (90 
mmole) was dried by co-evaporating twice with 250 mL 
anhydrous pyridine. The nucleoside was then dissolved in 
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300 mL anhydrous pyridine (or 1:1 pyridinc/DMF, for the 
dG** c nucleoside) under argon and cooled to —2° C. io an 
ice bath. A solution of 24.6 g (90 mmole) MeNPOC-Cl in 
100 mL dry THP was then added with stirring over 30 
minutes. The ice bath was removed, and the solution allowed $ 
to stir overnight at room temperature (TLC: 5-10% MeOH 
in CH 2 Q^ t>vo diastereomers). After evaporating the sol-* 
vents under vacuum, the crude material was taken up in 250 
mL ethyl acetate and extracted with saturated aqueous 
NaHCOj and brine. The organic phase was then dried over J0 
Na 2 S0 4 , filtered and evaporated to obtain a yellow foam. 
The crude products were finally purified by flash chroma- 
tography (9x30 cm silica gel column eluted with a stepped 
gradient of 256-6% MeOH in CHjCQ. Yields of the puri- 
fied diastereomeric mixtures are in the range of 65-75%. 

(2) 5^MeNP0C-2'-DE0XYNUCLE0SIDE-3HN,N- 
DIISOPROPYL 2-CYANOETHYL 
PHOSPHORAMIDITES) 



15 



MenpocO 




-Amlditiflg reagcnT *s. 
Daie DIEA/DCM ^ 



20 



MenpocO' 




Base 



For products in the 200 to 1000 bp size range, check 2 /d 
of the reaction on a 1.5% 05xTBE agarose gel using an 
appropriate size standard (phiX174 cut with Haelll is 
convenient). The PCR reaction should yield several pico- 
moles of product It is helpful to include a negative control 
(i.e., 1 /d TE instead of genomic DNA) to check for possible 
contamination. To avoid contamination, keep PCR products 
from previous experiments away from later reactions, using 
filter tips as appropriate. Using a set of working solutions 
and storing master solutions separately is helpful f so long as 
one does not contaminate the master stock solutions. 

For simple amplifications of short fragments from 
genomic DNA it is, in general, unnecessary to optimize 
Mg 2 * concentrations. A good procedure is the following: 
make a master mix minus enzyme; dispense the genomic 
DNA samples to individual tubes or reaction wells; add 
enzyme to the master mix; and mix and dispense the master 
solution to each well, using a new filter tip each time. 
2) PURIFICATION 

Removal of unincorporated nucleotides and primers from 
PCR samples can be accomplished using the Promega 
Magic PCR Preps DNA purification kit. One can purify the 
whole sample, following the instructions supplied with the 
kit (proceed from section IIIB, 'Sample preparation for 
direct purification from PCR reactions'). After elun'on of the 
25 PCR product in 50 /d of TE or H 2 0, one centrifuges the 
eluatc for 20 scc'at 12,000 rpm in a nucrofuge and carefully 
transfers 45 /d to a new microfuge tube, avoiding any visible 
pelleL Resin is sometimes carried over during the clution 
step. This transfer prevents accidental contamination of the 
p— OCH^HiCN 30 linear amplification reaction with 'Magic PCR' resin. Other 
. I . methods, e.g. size exclusion chromatography, may also be 

• * *Sfcj^ used* 

" T T : 3) LINEAR AMPLIFICATION 

In a 0.2 mL thin-wall PCR tube mix: 4 fi\ purified PCR 

Tne four deoxynucleosides were phosphitylated using 35 product; 2/d primer (10 P" 0 ^/^ 1 ^ 0 ^^,^ 1 
either 2-cyanoethyl-N,N-diisopropyl dNTPs (2 mM dA, dC, dG, 0.1 mMdT); 4*10 1 ^MdUTP 

cblorophosphoramidite, or 2-cyanoethyl-N.N,N',N'- 1 fd 1 mM fluorescein dUTP (Amersham RPN .2121* ; 1 U 
tetniscjropylphosphorodiamidite. The following is a typi- Taq polymerase (Perkm Elmer^ U//d); and add H,0 Uo 40 
cal procedure. Add 16.6 g (17.4 ml; 55 mmole) of h \. Conduct 40 cycles (92* C. 30 sec, 55 C. 30 sec, 72 C 
2-cyanoethyl-N,N^-t e traisopropylphosphorodiamidite 40 90 sec) of PCR. These conditions have been used to amplify 
to a solution of 50 mmole 5'-MeNPOC-nucleoside and 4.3 a 300 nucleotide mitochondrial DNA. fragment but are 
g (25 mmole) diisopropylammonium tetrazolide in 250 mL generally applicable. Even in the absence of a visible 
dry CHXl, under argon at ambient temperature. Continue product band on an agarose gel, there should still be enough 
stirring for 4-16 hours (reaction monitored by TLC: product to give an easily detectable hybndmUon signal If 
45:45:10 hexane/CH,CWEt 3 N). Wash the organic phase 45 one is not treating the DNA with uracil DNA. glycosylase 
with saturated aqueous NaHCO, and brine, then dry over (see Section 4), dUTPcan be omitted from the reaction. 
NajSO., and evaporate to dryness. Purify the crude amidite 4) FRAGMENTATION ...„.„ 
byflash chromatography (9x25 cm silica gel column cluted Purify the linear amplification product using the Promega 
with hexane/rrWTEA-45:45:10 for A, C, T; or 0:90:10 Magic PCR Preps DNA punfication kit, as per Section 2 
for G). The yield of purified amidite is about 90%. so above. In a 0.2 mL thm-wall PCR tube m«: 40 t*VV$* 

' ' * - -- - •• -• labeled DNA; 4 fd lOxPCR buffer; and 05 fd uracil DNA 

glycosylase (BRL lU//d). Incubate the "mixture 15 min at 
37* C, then 10 min at 97* ' C; store at -20° C until ready 
to use. 

5) HYBRIDIZATION SCANNING & STRIPPING 
A blank scan of the slide in hybridization buffer only is 
helpful to check that the slide is ready for use. The buffer is 
removed from the flow cell and replaced with 1 mL of 
(fragmented) DNA in hybridization buffer and mixed well. 



II. PREPARATION . OF. LABELED DNA/ 
HYBRIDIZATION TO ARRAY 
1)PCR 

PCR amplification reactions are typically conducted in a 
mixture composed of per reaction: 1 fd genomic DNA; 10 fd 55 
each primer (10 pmol/jul stocks); 10 fd lOxPCR buffer (100 
mM Tris.Cl pH85, 500 mM KQ, 15 mM MgClj); 10 /d 2 
mM dNTPs (made from 100 mM dNTP stocks); 2JS U Taq 

polymerase (Perkm Elmer AmpliTaq™, 5 Vffd); and ILO to y — B . ' r . .- t 

WoV The cycling conditfons are usually 40 cycles (94* C. 60 The scan is performed in the presence of the labeled target. 
45 sec 55* C 30 sec 72* C 60 sec) but may need to be FIG. 33 illustrates an illustrative detection system for scan- 

conditions are for 02 mL thin wall tubes in a Perkin Elmer a hybridization temperature of 25* C. yields a very dear 
9600 thermocyder. See Perkin Elmer 1992/93 catalogue for signal, usually in at least 30 min to two hours, bu tit maybe 
9600 cycle tune information. Target, primer length and ts desirable to hybndize longer, w, , overnight. Using a laser 
s^Ve^com^ition, among othe7factors, may also affect power of 50 fM and 50 ym one should obtau. 

^ 4 maximum counts in the range of hundreds to low thousands/ 

parameters. 
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pixel for a new slide. When finished, the slide can be 30 sec) are performed, but cycling conditions may need to 

stripped using 50% fonnamide. rinsing well in deionized be varied. These conditions are for 0.2 mL thin wall tubes in 

H 2 0, blowing dry, and storing at room temperature. Pcrkin Elmer 9600. For products in the 200 to 1000 bp size 

III. PREPARATION OF LABELED RNA range,check2/doftbereactiononal.5%0.5xTBEagarose 

/HYBRIDIZATION TO ARRAY 5 gel using an appropriate size standard. For larger or smaller 

1) TAGGED PRIMERS . volumes (20-100 /d), one can use the same amount of 

''. The primersused to amplify the target nucleic acidshould genomic DNA but adjust the other ingredients accordingly, 

have promoter sequences if one desires to produce RNA ' 4) IN VITRO TRANSCRIPTION ■ 

from the amplified nucleic acid. Suitable promoter Mix:3/dPCRproduct;4/d5xbuffer;2/dDTT;2.4/dl0 

sequences are shown below and include: io mM rNTPs (100 mM solutions from Pharmacia); 0.48/4 10 

(1) the T3 promoter sequence: mM fluorescein-UTP (Fluorescein-12-UTP, 10 mM 
S'-CGGAATTAACCCTCACTAAAGG (SEQ. ID NO:293) solution, from Boehringer Mannheim); 05 /d RNA poly- 
5 •-AATTAACCCTC ACTAAAG G GAG; (SEQ. ID NO:299) merase (Promega T3 or T7 RNA polymerase); and add H 2 0 

(2) the T7 promoter sequence: to 20/il. Incubate at 37° C. for 3 h. Check 2/d of the reaction 
5' TAATACGACTCACTATAGGGAG; (SEQ. ID NO:300) is on a \5% 0.5xTBE agarose gel using a size standard, 
and (3) the SP6 promoter sequence: 5xbuffer is 200 mM Tris pH 15, 30 mM MgClj, 10 mM 
5' AnTAGGTGACACTATAGAA. (SEQ. ID NO:301) spermidine, 50 mM NaCi, and 100 mM DTT (supplied with 
The desired promoter sequence is added to the 5* end of the enzyme). The PCR product needs no purification and can be 
PCR primer. It is convenient to add a different promoter to added directly to the transcription mixture. A20//1 reaction 
each primer of a PCR primer pair so that either strand may 20 is suggested for an initial test experiment and hybridization; 
be transcribed from a single PCR product. a 100 gA reaction is considered "preparative" scale (the 

Synthesize PCR primers so as to leave the DMT group on. reaction can be scaled up to obtain more target). The amount 
DMT-on purification is unnecessary for PCR but appears to of PCR product to add is variable; typically a PCR reaction 
be important for transcription. Add 25 /d 0.5M NaOH to will yield several picomoles of DNA. If the PCR reaction 
collection vial prior to collection of oligonucleotide to keep 25 does not produce that much target, then one should increase 
the DMT group on. Deprotect using standard chemistry— the amount of DNA added to the transcription reaction (as, 
55° C. overnight is convenient. well as optimize the PCR). The ratio of fluorescein-UTP to 

HPLC purification is accomplished by drying down the UTP suggested above is 1:5, but ratios from 1:3 to 1:10— all 
oligonucleotides, resuspending in 1 mLO.lM TEAA (dilute work well. One can also label with biotin-UTP and detect 
2.0M stock in deionized water, filter through 0.2 micron 30 with streptavidin-FITC to obtain similar results as with 
filter) and filter through 02 micron filter. Load 05 mL on fluorescein-UTP detection.. 

reverse phase HPLC (column can be a Hamilton PRP-1 ' For condenaturing agarose gel electrophoresis of RNA, 
semi-prep, #79426). The gradient is CH 3 CN over note that the RNA band will normally migrate somewhat 

25 min (program 02 //mol.prep.O-50, 25 min). Pool the faster than the DNA template band, although sometimes the 
desired fractions, dry down, resuspend in 200 /d 80% HAc. 35 two bands will comigrate. The temperature of the gel can 
30 min RT. Add 200 fd EtOH; dry down. Resuspend in 200 effect the migration of the RNA band. The RNA produced 
fd H 2 0, plus 20 /d NaAc pH5.5, 600 /d EtOH. Leave 10 min from in vitro transcription is quite stable and can be stored 
on ice; centrifuge 12,000 rpm for 10 min in microfuge. Pour for months (at least) at -20° C. without any evidence of 
off supernatant Rinse pellet with 1 mL EtOH, dry, resuspend degradation. It can be stored in unsterilized 6xSSPE 0.1% 
in 200/d H20. Dry, resuspend in 200/d TE. Measure A260, 40 triton X- 100 at -20° C. for days (at least) and reused twice 
prepare a 10 pmotyd solution in TE (10 mM Tris.Cl pH 8.0, (at least) for hybridization, without taking any special pre- 
0.1 mM EDTA). Following HPLC purification of a 42 mer, cautions in preparation or during use. RNase contamination 
a yield in the vicinity of 15 nmol from a 0.2 pmoi scale should of course be avoided. When extracting RNA from 
synthesis is typical. ceUs* il » preferable to work very rapidly and to use strongly 

2) GENOMIC DNA PREPARATION « denaturing conditions. Avoid using glassware previously 
For obtaining genomic DNA from human hair, one can contaminated with RNases. Use of new disposable plas- 

extract as few as 5 hairs, including hair roots. On a clean and ticware (not necessarily sterilized) is preferred, as new 
sterile surface, one places the hair on a piece of parafilm, and plastic tubes, tips, etc., are essentially RNase free. Treatment 
after wiping a new razor blade with EtOH cutting off the with DEPC or autoclaving is typically not unnecessary, 
roots, the roots are transferred to a 1.5 mL microfuge tube 50 5) FRAGMENTATION ^ 
using a pair of Millipore forceps cleaned with EtOH. Add ' In a 0.2 mL thin-wall PCR tube mix: 18 fd RNA(direct 
500 id (10 mM Tris.CI pH8.0, 10 mM EDTA, i00 mM NaCl, , . from transcription reaction— no purification required); 18 /d , 
2% (w/v) SDS, 40 mM DTT, filter sterilized) to the sample. H 2 0; and 4/d 1M Tris.Cl pH9.0. Incubate at 99.9 C. for 60 
Add 125 fd 20 mgfal proteinase K (Boehringer) Incubate at min. Add to 1 mL hybridization buffer and store at -20 C. 
55° C for 2 hours, vortexing once or twice. Perform 2x03 55 until ready to use. The alkaline hydrolysis step is very 
mL 1:1 phcnoLCHCU extractions. After each extraction, reliable. Tne hydrolysed target can be stored at -20 C. in 
centrifuge 12,000 rpm 5 min in a microfuge and recover 0.4 6xSSPE/0.1% Triton X-100 for at least several days prior to 
mL supernatant. Add 35 id NaAc pH5.2 plus 1 mL EtOH. use and can also be reused. _ mTiTXT ^ 
Place sample on ice 45 min; then centrifuge 12,000 rpm 30 6) HYBRIDIZATION SCANNING, & STRIPPING ^ 
min, rinse, air dry 30 min, and resuspend in 100 /d TE. w A blank scan of the slide in hybridization buffer only is 

3) PCR helpful to check that the slide is ready for use. The buffer is 
PCR is performed in a inixture containing, per reaction: 1 removed from the flow cell and replaced with # 1 mL of 

id genomic DNA; 4 id each primer (10 pmol/^1 stocks); 4fd (hydrolysed) RNA in hybridization buffer and mixed well. 
10 xPCR buffer (100 mM Tris.Cl pH85, 500 mM KCI, 15 Incubate for 15-30 min at 18° C Remove the hybridization 
mM MgCLV 4 id 2 mM dNTPs (made from 100 mM dNTP 65 solution, which can be saved for subsequent experiments, 
stocks); 1 U Taq polymerase (Perkin Elmer, 5 U//d); H 2 0 to Rinse the flow cell 4-5 times with fresh changes of 6xSSPE/ 
40/d. About 40 cycles (94° C. 30 sec, 55° C 30 sec, 72° C. 0.1% Triton X-100, equilibrated to 18° C The rinses can be 
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target: DNA 




FIG. 1C 
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„JL„ WII _. PTr . rm density of more than 100 members atknown locationsper 

ARRAYS OF MODIFIED NUCLEIC ACID ? preferably, more than 1000 members per cm . 

PROBES AND METHODS OF USE ^ ^ emb 5 diinentt> the arrays have a density of more 

m „„ ncucDCNTfT TO RELATED than 10,000 members per cm 2 . 

CROSS-REKREl^TO RELATED ^ ^ is constructed 

APPLICATION ^udes any material upon which oligonucleoude analogues 

This appUcation is a continuation-in-part of U.S. Ser. No. ^ a defined relationship to one another, such as 

08/440742 filed May 10. 1995 abandoned, which is a ^ ^ Especially preferred ohgonude- 

continu'ation-in-part of PCT application V*^***? otide analogues of the array are between about 5 and about 
United States) SN PCT/US94/12305 filed Oct. 76, ,1994. M 20 nucleotides , nucleotide analogues or a mixture thereof m 
which is a continuation-in-part of U.S. Ser. No. 08/284,064 

riled Aug. 2, 1994 abandoned, which is a contmuation-w- In one group 0 f embodiments,' nucleoside analogues 

part of US. Ser. No 08/143,312 filed Oct. 26, 1993 * orat J mtothe oligonucleotide analogues of the array 

abandoned, each of which is incorporated herein by reter- ^ ^ ^ formula: 
ence in its entirety for all purposes. is 

FIELD OF THE INVENTION 

The present invention provides probes comprised of 
nucleotide analogues immobilized in arrays on solid sub- 
strates for analyzing molecular interactions of bio ogical 20 
LteS^d urSTnucleic acids comprised of nuc eot.de 

analogues. The invention therefore relates to the mokcuur independently selected from the 

interaction of polymers immobilized on sohd ^substrates ^^"J^ / d rogen,methyL hydroxy. alkoxy(e.g 

includingrelatedchemistry.biology.andmed.caldiagnostic JJtoS^S, propoxy, aUyloxy, and propargyloxy , 
uses. 25 Xlthio. halolen (Fluorine. Chlorine, and Bromine). 

BACKGROUND OF THE INVENTION cy.no, and azido, and wherein Y is a heterocyclic mcuety. 

BACKGROUND ur J ^ ^ from ^ group consisUng of purines. 

The development of very large scale immobilized poly- ^ analogues> pyridines, pyrimidjne analogues, um- 

mer synthesis (VLSIPS™) technology provides pioneering ? bases ( 5 -nitroindole) or other B^* °F™| 

metiiods for arranging large numbers of oligonucleotide We of {ormiag one or more hydrogen bonds 

orobeTin very smaU arrays See, U.S. application Ser. No. £ co^onding moieties on alternate strands within a 

OtS.727 Zl VS. Pa^No. 5.424,186 and PCT patent ^ ^ t{ip ^ nDded nucleic acid or nuck.c acid 

pubUcation Nos. WO 90/15070 and 92/10092, each of which of Qther m]lps or ^ systems «P a ^ n °"?™ ,n 5 g 
Tmcorporated herein by reference for all ^purposes. VS. 3J neare *. neighb o r base-stacking ^racUons wubna double- 

oatent appUcation Ser. No. 08/082.937. filed Jun. 25. 1993, ^.stranded complex. In other embodiment, the_ oU 

Eclated herein for all purposes, describes methods P deotide analogues are not instructed from 

for 3g arrays of oligonucleotide probes that are used * ^ but m ^e of binding to n«^.c ac^ m 

e g to determine the complete sequence of a target nuc eic due to structural similarities between the oligo- 

add Id/or to detect the presence of a nucleic acid with a ^ analogue and a namraUy occurr^g nucleic acid 

specified sequence. An example of such »J<^^£55 

VLSIPS™ technology provides an efficient means for nucleic acid or polyamide nucleic acid in which base* > wmcn 

lar^sSe production of miniaturized oligonucleotide hydrogen bond to a nucleic acid are attached to a polyamide 

srravs for sequencing by hybridization (SBH), diagnostic backbone. . 

arrays tor sequen _ b , somatically acquired geneUc -ru. oresen t invention also provides target nucleic acios 

50 probes. Typically, the oUgonucleotide probe arrays also 



interactions. 



SUMMARY OF THE INVENTION comprise nucleotide analogues. . . . 

The present invention provides arrays of oligonucleotide ^^^^ZcTfl^^^ 

JloLes attached to solid substrates. OUgonucleotide ana- providing » J*** 3^ For instance, nucle- 

SuThlve different hybridization properties man ohgp- enz^aac copymg of a nudetc «ad . Fo 

nudeoddes based upon naturaUy occurring nucleotides. By 5 5 o tide "^^*" a paction. Thus, a 

Sroorating oUgonucleotide analogues into the arrays of , 0 be analyzed is typi- 

tbe invention, hybridization to a target nucleic acid is ^^JJJJj^R or RNA amplification procedure 

^nucleotide analogue arrays, have vu^ally any £ .'3StS 

5 — a «o y - analogue arrays and targe.uc.eic acids 

array in a given application. In one group f are optionally composed of oligonucleotide analogues 

the array has from 10 up to 100 oligonucleotide analogue are opu y J hydro i ysis 0 r degradation by nuclease 

mt^ln^V^^^^^J^ « 2?Ju RNAase A. This has the advantage of 

between 100 and 10.000 members, and m yet omer^bod^ <s enzyn^ ^su add ^ ter 

^'inT^^ K^by rendermg i, res.un, .enzymatic de.adauon. 
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For example, analogues .comprising 2'-0- 

mcthylongoribonuclcotidcs arc resistant tc , RNA« £ J«J* c taeminuy. I ^ ^ ^ mor ^ mOQOmeric 

Oligonucleotide analogue arrays are opUoaaUy arranged analogue re have some structural features 

into libraries for screening 5 f^mmon with a naturally earring oUgonucleotide which 

characteristics, such as the abihty to bind a sp, ^. ^lowie- aUow it to hybridize with a naturaUy occurring oligonucle- 

nucleotide analogue, ■ or oligonucleotide analogue- auo w u 10 y structural groups are option- 

containing structure. The Ubranes also ^^^^ g™ Vdded Tfta S£ or bLc of a nucleoside for incor- 

an^gooucleotide, such as a methyl or allyl 

™£e^ P»bes which 1Q group at the 2'-0 position on the ribose, or a Guoro group 

£nt ihM SS stStum of interests instance, 10 fvhich substitutes for the 2M3 group or a b»»OT on 

Sav of oligonucleotide analogues optionally include a the ribonucleoside base. The phosphodiester linkage, or 

SuraUty of dS members, each member having the ^gar-phosphate backbone" of the ohgonucleoUde ana- 

fomula- Y-l'-X^-X 2 , wherein Y is a solid log ^ e * substituted or modified, for instance with methyl 

substrate X 1 and X 2 are complementary oligonucleotides p hosphonates or O-methyl phosphates. Another example of 

containing at least one nucleotide analogue, L 1 is a spacer, an olig0 nucleotide analogue for purposes of this a^ 10 ^ 

and L 2 is a linking group having sufficient length such that mcludcs "peptide nucleic acids" in which native or modified 

X 1 and X 2 form a double-stranded oligonucleotide. An array nucleic acid bases are attached to a polyamide backbone 

of such members comprise a library of unimolecular double- ohgonucleoUde analogues optionally comprise a mixture of 

stranded oligonucleotide analogues. In another embodiment, natura ii y occurring nucleotides and nucleotide analogues 

the members of the array of oligonucleotide are arranged to However, an ohgonucleoUde which is made entire^ ot 

present a moiety of interest within the oligonucleotide nalur ally occurring nucleotides (i.e., those comprising UNA 

analogue probes of the array. For instance, the arrays are 0f RNA)> ^ the exception of a protecting group on the end 

optionally conformaUonally restricted, having the formula of ^ oligonucleotide, such as a protecting group used 

_-X u — Z— X 12 , wherein X 11 and X 12 are complementary during stan dard nucleic acid synthesis is not considered an 
oligonucleotides or oligonucleotide analogues and Z is a 25 oligonuc i e oUde analogue for purposes of this invention, 

chemical structure comprising the binding site of interest. A « nucle0 side" is a pentose glycoside in which the 

Oligonucleotide analogue arrays are synthesized on a ag j ycone is a heterocyclic base; upon the addition of a 

solid substrate by a variety of methods, including light- phospD ate group the compound becomes a nucleotide. The 

directed chemical coupling, and selectively flowing syn- ffl . biological nucleosides are ^-glycoside derivatives of 

thetic reagents over portions of the solid substrate. The solid * or D -2-deoxyribose. Nucleotides are phosphate 

substrate is prepared for synthesis or attachment of ohgo- esters of nucleosides which are generally acidic in solution 

nucleotides by treatment with suitable reagents. For due tQ me hydroxy groups on the phosphate. The nucleo-. 

example, glass is prepared by treatment with silane reagents. sides of DNA and rn A are connected together via phos- 

Trie present invention provides methods for determining . le units attachc d to the 3' position of one pentose and the 

whether a molecule of interest binds members of the oligo- y osition of the next pentose. Nucleotide analogues and/or 

nucleotide analogue array. For instance, in one embodiment, nucleo side analogues are molecules with structural sunilan- 

a target molecule is hybridized to the array and the resulting {[qs lQ ^ natura ii y occurring nucleotides or nucleosides as 

hybridization pattern is determined. The target molecule discussed above in the context of ohgonucleoUde analogues, 
includes genomic DNA, cDNA, unspliced RNA, mRNA, ^ ^ « nuc ieic acid reagent" utilized in standard automated 

and rRNA, nucleic acid analogues, proteins and chemical oligonuclcot i de synthesis typically caries a protected phos- 

polymers. The target molecules are optionally amplified qq ^ y hydroxyl of m e ribose. Thus, nucleic acid 

prior to being hybridized to the array, e.g., by PCR, LCR, or reagents are referred to as nucleotides, nucleotide reagents, 

cloning methods. nucleoside reagents, nucleoside phosphates, nucleoside-3 - 

The oligonucleotide analogue members of me array used phosp hates, nucleoside phosphoramidites, 
in the above methods are synthesized by any described p hosphoram idites, nucleoside phosphonates, phosphonates 
method for creating arrays. In one embodiment, the ohgo- ^ ^ ^ R fe generally understood that nucleotide 
nucleotide analogue members are attached to toe solid nts carry a reactive, or activatible, phospboryl or 
substrate, or synthesized on the solid substrate by light- hosphonyl mo iety in order to form a phosphodiester link- 
directed very large scale immobilized polymer synthesis, * 

e g., using photo-removable protecting groups during syn- ^ „ tecting group » ^ use d herein, refers to any of the 
thesis. In another embodiment, the oligonucieoUde members r ^ des igned to block one reactive site in a 
are attached to the solid substrate by forming a plurality or while a chemical reaction is carried out at another 
channels adjacent to the surface of said substrate, placing reactivc site> More particularly, the protecting groups used 
selected monomers in said channels to synthesize ohgo- ^ optioQally any of those groups described in 
nucleotide analogues at predetermined portions of selected ^ ^ ^Protective Groups In Organic Chemistry, 2nd 
regions, wherein the portion of the selected regions com- ^ ^ WUey & Ncw york> ^ 1991, which is 
prise oligonucleotide analogues different from ohgonucle- mcorporat ed herein by reference. Hie proper selection of 
otide analogues in at least one other of the selected regions, g^ps f or a particular synthesis is governed by 
and repeating the steps with the channels formed atong a ^ overaU m e th ods employed in the synthesis. For example, 
second portion of the selected regions. The solid substrate is ^ "Ugnt-directed" synthesis, discussed herein, the protect- 
any suitable material as described above, mcluoing beads, • ^ are pholo i a bUe protecting groups such as NVOC, 
slides, and arrays, each of which is constructed from, e.g., M*VRa£ v& those disclosed in co-pending Application 
silica, polymers and glass. PCT/US93/10162 (filed Oct. 22, 1993), incorporated herein 

DEFINITIONS 65 by reference. In other methods, protecting groups are 
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A ^c" is a generic term base, upon the specific 

compound "purine" having a skeletal stnic ure derived from J*^™™^ Gait , £ d ., irl Press, Oxford (1984); 

the rusion of a pyrimidine ring and an imidazole nng. It is A Pr ° ct ™ l W "» ^ £ ^ Raeuch 18(17)( 5197 

generally, and herein, used to describe a generic class of W H A R K JP Due holm J. 0r% . Chem. 59. 5767-5773 

compounds which have an atom or a group of atoms added (i ^ ). ■ fa Mobxablr Biology, 

,o the parent purine compound, such as the bases found in (1994) and fc. Agr ^ , m&J hcrein by refcr . 

the naturally occurring nucleic ^.ds adenine ?™ m ° for M urposes . Synthesizing unimo- 

(6^andnopurine)andguanme(2-am.nc».6-oxopunne^ o r l ess ence m £ J D * A in M , utioD nas also bee n 

Commonly occurring molecules such as 2-amino-adenine, ^J*™™ ding applicat i on S er. No. 08/327,687, 

N«-methyladenine, or 2-methylguanine. y g paL Nq 5,555752 which is incorporated herein 

A "purine analogue" has a heterocyclic ring with stnic- ^ ^ ' 

rural similarities to a purine, in which an atom or group 10 met hod« of forming large arrays of 

atoms is substituted for an atom in the purine nng. r or * . ^ ades ^ other polymer sequences 

instance, in one embodiment, one or more N atoms of the ^g^^-J^, of syDthetic ^ are known. See, 

purine heterocyclic ring are replaced by C atoms. jXiTd U.S. Pat. No. 5,143,854 (see also, PCT 

A "pyrimidine" is a compound with a specific heterocy- Application No. WO 90/15070) and Fodor et al, PCT 

cUcdiazine ring structure, but is used generically by persons p*^,^ No . W 0 92/10092, which are incorporated 

of skill and herein to refer to any compound having a herein by re f e rence, which disclose methods of forming vast 

1,3-diazine ring with minor additions, such as the common ^ ^ ddeSi oligonuc ieotides and other molecules 

nucleic acid bases cytosine, thymine, uracil, J ' e le> ^.directed synthesis techniques. Sec 

5-methylcytosine and 5-hydroxymethylcytosine, or the non- ^ ¥odot et al., (1991) Science, 251, 767-77 wludi is 

naturally occurring 5-bromo-uracil. incorporated herein by reference for all purposes. These 

A "pyrimidine analogue" is a compound with structural procedures for synthesis of polymer arrays are now referred 
similarity to a pyrimidine, in which one or more atom in the ^ ,„ ^ VLSIPS™ procedures. 

pyrimidine ring is substituted. For instance, in one ^ ^ y^p™ ap p roac h, one heterogenous array of. 

embodiment, one or more of the N atoms of the ring are <=_ ^ through simuUa neous coupling at a 

substituted with C atoms. n umber of reaction sites, into a different heterogenous array. 

A "solid substrate" has fixed organizational support See> u.S. appUcadon Ser. No. 07/796^43 now U.S. Pat. No. 
matrix, such as silica, polymeric materials, or glass. In some 30 5^34,261 and U.S. application Ser. No 07/980,523 now 

embodiments, at least one surface of the substrate .s partiaUy v § pa( No 5,677,195, the disclosures of which are incor- 

olanar In other embodiments it is desirable to physically .. tilei herem for all purposes.. 

separate regions of the substrate to delineate synthetic ^ devel ment of VLSIPS™ technology as described 

regions, for example with trenches, grooves wells or the above . note d U.S. Pat. No. 5.143,854and PCT patent 
like Example of solid substrates include slides, beads and 35 bUcation Nos . WO 90/15070 and 92/10092 is considered 

arrays. pioneering technology in the fields °f»? bm * 0 ^*£* e ; 

~.v,„vt,-c and screenine of combinatorial libraries. More recently, 

DESCRIPTION OF THE DRAWINGS "J3oS«. No. 08/082,937, filed Jun. 25, 1993 

FIG 1 shows four panels (FIG. 1A, FIG. IB, FIG. 1C and (incorp0 rated herein by reference), describes methods for 

FIG lift nCS lAand IB graphically display the differ- 40 0 f oligonucleotide probes that are used to 

ence' in fluorescence intensity between the matched and check * r determin6 a part ;al or complete sequencer a target 

rrisrnatehed DNA probes. FIGS. 1C and ID illustrate the nudeic acid and t0 delec t the presence of a nucleic acd 

difference in fluorescence intensity verses location on an conta ini ng a specific oligonucleotide sequence, 

example chip for DNA and RNA targets, respectively. Combinatorial Synthesis of Oligonucleotide Arrays 

PIG 2 is a graphic illustration of specific light-directed VLSIPS™ technology provides for the combmatona 

chemical coupling of ofigonucleotide analogue monomers to thesis of oligonucleotide arrays. The combmatonal 

~ VLSIPS™ strategy allows for the synthesis of arrays con- 

FIG. 3 shows me relative efficiency and specificity of ^ a i^e number of related I probes using ^ 

hybridization for immobilized probe arrays containing number of syn ,heuc steps. » 

2,6-diaminopunne (D^u, ggwM iz P y P ^ only ^ sy0thetic steps . 

"FIG { iZ££S£ oi 5L. »W^ m brief, me m^.^S^S^ 

tions in oligonucleotide arrays. (3 -ATGTT(G1G2G3G4G5) group e*, y J photolysis through a pholohthog- 

CGGGT-5' (SEQ ID NO:5)). S£mS is used (selectively to expose functional groups 

DETAILED DESCRIPTION which are then ready to react with «""»»8 

Methods of synthesizing desired single stranded oligo- « ^g^^SS^SSSA are 
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blocking group). Thus, the pLphoramidites only add to ^ yl ^ 

those areas selectively exposed from the preceding step [of cytosine and uracil 

These steps are repeated until the desired array o M nucleosides, nucleotides and various 
have been synthesized on the solid surface. Combinatorial m y ^ ^ tion ^ nucleosides are corn- 
synthesis of different oligonucleotide analogues at dillerenl 5 ^ availab i e from a variety of manufacturers, includ- 
locations on the array is determined by the pattern oi in e the" SIGMA chemical company (Saint Louis, Mo.), R&D 
illumination during synthesis and the order of addmon or * ^ (Mjnneapolis> Minn.), Pharmacia LKB Biolechnol- 
coupling reagents. ogy (Piscataway, NJ.), CLONTECH Laboratories, Inc. 

In the event' that an oligonucleotide analogue with a _ to MtQ ^ Chem Genes Corp., Aldrich Chemical 

polyamide backbone is used in the VLSIPS™ procedure, it i° Company (Milwaukee, Wis.), Glen Research, Inc GIBCO 

is generally inappropriate to use phosphoramidite chemistry fiRL ^ Technologies, Inc. (Gaithersberg, Md.), Huka 

to perform the synthetic steps, since the monomers do not chemica-Biochemika Analytika (Fluka Chemie AG, Bucns, 

attach to one another via a phosphate linkage. Instead, Sw i tze rland), Invitrogen, San Diego, Calif., and AppUed 

peptide synthetic method are substituted. See, e.g., Pirruog Biosyst ems (Foster City, Calif.), as well as many other 

et al U.S. Pat No. 5.143,854. 15 commercial sources known to one of stall. Methods ot 

Peptide nucleic acids are commercially available from, allachi „g bases to sugar moiette jto form nucl °°* de ^£ 

,« TWarch Inc (Bedford, Mass.) which comprise a known. See, e.g., Lukevics and Zablocka (1991), Nucieosiae 

noWaSbSboneand the oases found in naturally occur- S)7lthesis : Organasilicon Methods Ell* Horwood Limited 

rinfnu ts t pSde nucleic acids are capable of bkd- (Chester. West Su^ 

S nucTeic acids with high specificity, and are considered 20 Melhod s of phosphorylating nuc eosides o form 

%£££S£L^ forpurposesof this disclosure. nucleotides, and of incorporatmg ; nucleoud^ into ohgo- 

Note that peptide nucleic acids optionally comprise bases nucleotides are also known. See. e.g., Agrawal (ed) ( 993) 

£ SIS- which are natu^ly oaring. ^K£.ltriS 

Hybridization of Nucleotide Analogues 25 HurnanaJEtas^Towota, NJ., and the references therein. See 

TnestabilityofduplexesformedbetweenRNAsorDNAs J^SoS^^bku. and Sanghvi and Cook, and the 

are generally m the order o . theKm both supra. 

RNA: RN A>RNA:DNA>DNA:DNA, in solution Long K ™<°™™ ^ tQ ^ on the nucleo- 

probes have better duplex stability with a target, but poorer groups are a^ unx rings which 

mismatch discrimination than shorter probes (mismatch 30 side on tn p ^ ^ 

discrimination refers to the measured hybndizaUon signal m y s ^y^f 7 osphatc backb one, or through 

ratio between a perfect match probe and » • '^ng Lerfctiols in the major and minor 

mismatch probe. Shorter probes (e.g 8-mers) dominate hydrogen bondmg uv ^osine nucleotides 

mismatch* very well, but the overall duplex stabflr* ts low f?*"^^^ the N» position with an imida- 

In order to optimize mismatch discrimination and duplex 35 ^^^^^^ ^ stabfli ^ Universa , 

stability, the present invention provides a variety of nude- ^JgSSih as 3-nitropyrrole and 5-nitroindole are 

otide analogues incorporated into polymers and attached ,n ba^analogues^n ^ ^P*^ ^ , 0 bapmn 

an array to a solid substrate. d upi ex stabiHty through base stacking interacUons. 

Altering the thermal stability (TJ of the duplex formed £ rf oli ucleotidc pro bes is also an 

between the target and the probe using, e.g known ohgo- 40 J ^ whea optimizing hybridization 

nucleotide analogues allows for optimization of duplex ^ In ral> shorler p rob e sequences are more 

stabiUty and mismatch discrimination One useful aspect 01 £ £ ± ^ 0 f a single- 

alteringtheT m arises from the fact mat Adenme-Jymine , ^ eatcr 4**®** effect on the 

(A-T) duplexes have a lower T ^ a G f u ^ T - C ^° S x m e e s 45 hybrid duplex. However, as the overall thermodynamic 

(G-C) duplexes, due in part to the fact that the A-T duplexes y 0 f hybrids decreases with length, in some embodi- 

have 2 hydrogen bonds per base-pair, while the G-L J fe to eahanoe duplex stability for short 

duplexes have 3 hydrogen bonds per base pair. In neiero- globally Certain modifications of the sugar moiety in 

geaeous oligonucleotide arrays in which there is a mm- P ^ leod / es ovide useful stabilization, and these can 

uniform distribution of bases, it can be difficult to optimize s 0 f probes for complementary 

hybridization conditions for _all probes ^ ^^P^ 50 acid sequences. For example. 2'-0-methyl- 2!-Q- 

in some embodiments, it is desirable to destabilize G-t-ncn y-O-aUyl-oligoribonudeotides have higher 

duplexes and/or to increase the stability of ^A-T-nct .duplexes p py . ^ ^ ^ ^ seances than 

whfle maintaining the sequence specificity of hybndizaUon * nrflodified counterparts. Probes comprised of 

This is accomplished, e.g., by replacing one or more ot tne fl 2 ._ d ollgoribonU cleotides also form more 

native nucleotides in the probe (or the targe ) with certain 55 *u / ^ 

modified, non-standard nucleotides. Substitutuin of guanme sUWe byonos 

residues with 7-deazaguanine, for example, will V*>™Y Reol ace m ent or substitution of the internucleotide phos- 

destabOize duplexes, whereas substituting adenine res dues J2£~£ ^ oli 0 . or polynucleotides is also used 

with 2,6^ianunopurine will enhance duplex stability. A. phodiest* _ hnkage u ^oligo poy ^ 

variety of other modified bases are also ;n«.rporated in 0 60 to e ^ n nc ^ e ex 0 a r m ° ^^ e % bs uruting phosphodiester link- 
nucleic acids to enhance or decrease overall duplex j stability ^^^^oi^ or pbosp L P rodithioate linkages 
whQe maintaining specificity of hybridization^ The mcorpo- ages wiU. pnosp j ^ ^ ^ 
ration of 6-aza-pyrimidine analogs into ohgonucleot.de ^^y lZS^^ a non-ionic methylphospho- 
probesgeneraUy decreases their bindmg a^ty fo^ comple- ^f^g^rtSc or preferably. R P stereochemistry) 
mentarjnucleic acids. Many 5-substituted Pynrmdmes sub- 65 nMe ^g^^J^ ^ [omitioa . Neutral or 

cSct^ramidate linkages also — in enhanced 
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duplex stabilization. Tbe phosphate diester backbone has ^ b ™ 3 ^ 

SS^c^ m uOibriu, studies, kinetic "on-rate" studies, 

Spfutic gen* Set . , Crooke and Lebleu (eds) and ^ence specificity analysis is >optionaUy peifoniKdfor 
SKStoS Research Applications CRC Press; and, 5 any target oligonucleotide and probe or probe analogue. Tbe 
Sanehvi and Cook (eds) (1994) Carbohydrate modifications . data obtained shows the behavior of the analogue^ upon 
in Antisense Research ACS Symp". Ser. #580 ACS, Wash- duplex format ion with target oligonucleotides.. Altered 
incton DC Very stable hybrids are formed between nucleic - duplexsUb ility conferred by using oligonucleotide analogue 
acids and probes comprised of peptide nucleic acids, in 5esare ascertained by following, e.g., fluorescence signal 

which the entire sugar-phosphate backbone has been %Q hteasityo f 0 iigo D ucleotide analogue arrays hybndized with 
replaced with a polyamide structure. a largc t oligonucleotide over time. The data allow oplimi- 

Another important factor which sometimes affects the use 2atioQ of spccific hybridization conditions at, e.g., room 
of oligonucleotide probe arrays is the nature of the target tempcratuic (for simplified diagnostic applications), 
nucleic acid. Oligodeoxynucleotide probes can hybrids e to Mother way of verifying altered duplex stability is by 
DNA and RNA targets with different affinity and specificity^ 15 following ^ signa i intensity generated upon hybridization 
For example, probe sequences containing long "runs" of ^ timc Prcvious experiments using DNA targets and 
consecutive deoxyadenosine residues form less stable D na cWps have shown that signal intensity mcreases with 
hybrids with complementary RNA sequences than with the ^ and mal the more stable dup icxes generate higher 
complementary DNA sequences. Substitution of dA in the y Mcosl!tk& faster than less stable duplexes. The signals 

probe with either 2,6-diaminopurine deoxyriboside, or 20 reach a pUtMU 0f « saturat e" after a certain amount of time 
2*-alkoxy- or 2'-fluoro-dA enhances hybridization with RNA due tQ aU of |he binding s i tes becoming occupied. These data 
targets. allow for optimization of hybridization, and determination 

Internal structure within nucleic acid probes or the targets of ^^3^ conditions at a specified temperature, 
also influences hybridization efficiency. For example G of signal - m[eT]S \ty and base mismatch positions 

GC-rich sequences, and sequences containing "runs of 25 afe lotled and t he rat i os of perfect match versus mis- 
consecutive G residues frequently self-associate to form m ^ chQS calcula { ed . This calculation shows the sequence 
higher-order structures, and this can inhibit their binding to properties of nucleotide analogues as probes. Per- 

complementary sequences. See, Zimmermann et al. (1975) fccl matchAnisma tch ratios greater than 4 are often desirable 
J Mol Biol 92: 181; Kim (1991) Nature 351: 331; Sen and ^ ^ oU deotide diagnostic assay because, for a diploid 
Gilbert (1988) Nature 335: 364; and Sunquist and Klug 30 g enom e,ratiosof2 have lobe distinguished (e.g., in the case 
(1989) Nature 342: 825. These structures are selectively of a heterozygous trait or sequence), 
destabilized by the substitution of one or more guanine Nucleic Acids Which Comprise Nucleotide Ana- 

residues with one or more of the following purines or purine ^* 

analogs: 7-deazaguanine S-"*- 7 ;.^"* 8 "" 1 "?' M Mo dified nucleotides and nucleotide analogues are incor- 
2-aminopurine, IH-purine, and hypoxanthme, in order to 35 Mom ' ^ qt cnzvmatically int0 DNA or RNA 

enhance hybridization. [ nuclcic acids for hybridization analysis to ohgonucle- 

Modified nucleic acids and nucleic acid analogs can also ^ ^ ^ incorporation of nucleotide analogues in 

be used to improve the chemical stability of probe arrays optimizes the hybridization of the target in terms 

For example, certain processes and conditions that are useful * nce spe cificity and/or the overall affinity of binding 

for either the fabrication or subsequent use of the arrays, 40 ^ leotide and oligonucleotide analogue probe 

may not be compatible with standard ohgonucleotide ^ Qf nuc]eotide ana i ogU es in either the oligo- 

chemistry, and alternate chemistry can be employed to nud ; otide amy or me target nucleic add, or bom, improves 

overcome these problems. For example, exposure to acidic ontimizability of hybridization interactions. Examples of 

conditions will cause depurinatioo of purine °^ eotldes » £ M nucIeotid e analogues which are substituted for natu- 

ultimately resulting in chain cleavage and overall degraoa- 45 - ^ nuc l eo tides include 7-deazaguanosine, 2,6- 

tion of the probe array. In this case, adenine and guamne are j inopurinc nucleotides, 5-propynyl and other 

replaced with 7-deazaademne and 7-deazaguanine, - bsti * lcd pyrim idine nucleotides, 2'-fluro and 

respectively, in order to stabilize the oUgonucleohde probes 2 ,_ melh .2'Hdeoxynucleotides and the like, 

towards acidic conditions which are used during the manu- ^ nuc i e otide analogues are incorporated into nucleic 

facture or use of the arrays. . 5 adds ^ m(? synth etic methods described supra, or using 

Base, phosphate and sugar/modifications are^usea in RNA polymerases. The nucleotide analogues are 

combination to make highly modified obgonucleoti* . ana- ™k» ted ^ target nucleic acids using m 

logues which take advantage of the propemes of each of Uhe p c crabiy * mcthods guch as pCR( LCR, 

various modifications. For example, oligonucleotides which ofl ^ expansion, in vitro transcription (e.g., nick 

have higher binding affinities for -nP^^^g* 55 or rTndom-primer transcription) and the like, 

than their unmodified counterparts (e.g., 2 -U-metnyi-, t -u- eroative i v the nucleotide analogues are optionally lncor- 

propyl-, and 2'-0-allyl oUgonucleotides) can be mcorpo- ^ c bn ed nucleic acids by culturing a cell which 

rated into oligonucleotides with modified bases ^ ^ the cloae d nucleic acid in media which mcludes 

(deazaguanine, 8-aza-7-deazaguanine :, 2-aminopunne, ^ 

IH-purine, hypoxanthine and the like) with non-ionicn^- 60 nutfeo *u ^ ^ ^ 

ylphosphonate linkages or neutral or ^omc Phosphorami- fa ^ [n , t nucleic ac ids to substitute 

date linkages, resulting in ^ t lfS^tM fofS to^hance target hybridization by reducing sec- 
formation between the ohgonucleotide and a target nucleic tor 0/ containing mnsof poly-G/dG. 
acid. For instance, one preferred ohgonuc* ^ «g» ^^tSne nuclides substitute for A/dA to enhance 
a 2'.0-methyl-2,6-diaminopunneriboside P^ tol ?^£ 65 ^^ yb P ridi zation through enhanced H-bonding to T or U 
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nucleotides substitute fornatur, pyrites to enhance ^^S^^" 

target hybridization to certain purine rich p obes. 2 -fluro mdu* .jm y * the ^ aUachin 

and 2'-methoxy-2^eoxynucleot.des substitute for natural Ucuiany o]i leoMt ana logue is either bis(2- 

nucleotides to enhance target hybridization to similarly fc ydroxyethy |): am i n0 propyltriethoxysilane, n-(3- 

substituted probe sequences. 5 u j et hoxysilylpropy l)-4-hydroxybutylamide, 

Synthesis of 5'-photoprotected 2'-0 alkyl ribonucleotide . am ; nopr0 p y itrietboxysilane or hydroxypropyltriethoxysi- 

analogues lane. ' . 

The light-directed synthesis of complex arrays of nucle- . oligoribonucleotides generated by synthesis using 

otide analogues on a glass surface is achieved by derivatiz- ord inary ribonucleotides are usually base labile due to the 

ine cvanoethyl phosphoramidite nucleotides and nucleotide presence of the 2'-hydroxyl group 2-0- 

nucleoside analogues of uridine, thymidine, nUyloligoribonucleotides (2'-OMeORNs) ^ °.^naL 

group. See, application SN PCT/US94/12305 ^ ^ Q ^ Universily PresS| 1991> pp . 4 9-86, incor- 
Specific base-protected 2'-0 alkyl nucleosides are com- ^ fey refc|ence for M pur poses, have reported 
merciaUy available, from, e.g., Chem Genes Corp. (MA;, jj^ synt hesis of mixed sequences of 2'-0-Methoxy- 
The photolabile MeNPoc group is added to the 5 -hydroxy O i: oor ibonudeotides (2'-0-MeORNs) using dimethoxytrityl 
position followed by phosphitylation to yield cyanoethyl M ^ horamidite cbem istry. These 2'-0-MeORNs display 
phosphoramidite monomers. Commercially available r ^ bmding affinity for complementary nucleic acids 
nucleosides are optionally modified (e.g., by 2-O-alkyIation) ^ unmodined counterparts, 
to create nucleoside analogues which are used to generate q ^ embodiments of me iavenUon provide mechanical 
oligonucleotide analogues. means (o nerate oligonucleotide analogues. These tech- 
Modifications to the above procedures are used in some u ^ ^ djscussed m co-pending application Ser. No. 
embodiments to avoid significant addition of MenPoc to the 0 7/796,243, filed Nov. 22, 1991, which is incorporated 
3'-hydroxyl position. For instance, in one embodiment a by ' reference ^ entirety for all purposes. 
r-O-methyl ribonucleotide analogue is reacted with DMT- ^^uy oligonucleotide analogue reagents are directed 
CI {di(p-methoxyphenyl)phenylchloride} in the presence of ^ ^ Qf a subs[rate ^ that a predefined array 
pyridine to generate a 2^0-methyl-5 , -0-DMT^nucle- x ^ oli cleotide analogues is created. For instance, a 
otide analogue. This allows for the addition of TBDMS to rf channelSj grooves, or spots are formed on or 
the 3'-0 of the ribonucleoside analogue by reaction with • to a ^u bstrale . Reagents are selectively flowed 
TBDMS-Triflate (t-butyldimethylsilyltrifluoromethane- ^ or deposited m thc chann els, grooves, or spots, 
sulfonate) in the presence of tnethylamine in 1HI- formjDg an array having different oligonucleotides and/or 
(tetrahydrofuran) to yield a 2'-0-memyl-3^0-TBDMS-5 -O- 35 cleotide ana i og ues at selected locations on the sub- 
DMT ribonucleotide base analogue. This analogue is treated 

with TCAA (trichloroacetic acid) to cleave off the DMT Detection of Hybridization 

group, leaving a reactive hydroxyl group . at die 5 posihon embodiment, hybridization is detected by labeling 

MeNPoc is then added to the oxygen of the 5 hydroxyl « fluorescein or other known visualization 

group using MenPoc-Cl in the presence of B^-T* 40 ^ andin * cub V tmg me target with an array of pUgonucle- 

TBDMS group is then cleaved with F" (e.g NaF) to yield g ex formation fey me , 

a ribonucleotide base analogue with a MeNPoc group ouoe P P ft h^on in embodi- 

attached to the 5' oxygen on the nucleotide analogue. If wuh a probe in^u. y^ ^ ^ doubk . 

appropriate, this analogue is phosphitylated to yield a phos- menu jh «M * ^ fa ^ c g m 

phoramidite for oUgonucleotide « a ^ 0 ,J r and detected by viewing the array, e.g., through 

nucleosides or nucleoside analogues are protected by similar i ^ a ^ g mfcn 4 eope . 

procedures. Seauencine by hybridization 
Synthesis of Oligonucleotide Analogue Arrays on Chips JJ ac L g methodologies are highly reliant on 
Other than the use of photoremovable protecting groups, ^ dures and require ^anHal manual effort, 
the nucleoside coupling chemistry used in VLSIK> le ™" 50 conventional DNA sequencing technology is a laborious 
nology for synthesizing oligonucleotides and ohgooucie- „ ocedure ™ uiting electrophoretic size separation of. 
otide analogues on chips is similar to that used tor ougo- r ^ DNAfra _ ents ^ alternative approach involves a 
nucleotide synthesis. The oUgonucleotide is typically linked MjxMion strategy carr ied out by attaching target DNA to 
to the substrate via the S'-hydroxyl group of the ohgonucte- J fe mt ated ^ , 0 f oligonucle- 
otide and a functional group on the substrate which resulte 55 ■ ^ ^ ^ ^ mUcition SN PCT/US94/ 
in the formation of an ether, ester, carbamate or phosphate » »j 

ester linkage. Nucleotide or oligonucleotide analogues are oreferre d met hod of oUgonucleotide probe array syn- 

attached to the solid support via carbon^rtwn bonds using, V fa f y^, to direct synthesis of 

for example, supports having ^y^^^Z „ SiSSS !?JoS« probe, in high^ensity, miniaOu- 

surfaces, or preferably, by siloxane bonds (using, for 60 Matrices of spatiaUy-defined oUgonucleotide 

example, glass or silicon oxide as the so id support). Silox- «J arrays we7e generated. Tne abiUty to use 

ane bonds with the surface of the support are formed in one an^ogue ^ pro y » s was dem . 

embodiment via reactions of surface atuching portions J^SCfiXd&ng fluorescent labeled oligonucle- 

bearing trichlorosUyl or trialtoxysilyl groups. The surface ^^.J .JKe. produced, 

attaching groups have a site for attachment of the ohgo- 65 oUdes^ tc , me m p ^ 

nucleotideanalogueportionForexamplegroupswbchare ^^^^^^o,^. 

suitable for atuchment include amines, hydroxyl, thiol, and sequence specinc nyonu 
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ol.gpnucleot.dc analogue complexes. ^ ^ 0 ,; leotide analogue . 

Oligonucleotide analogue Probe Arrays and UWs «yp? ^ ^ preparation 

The use of oligonucleoude analogues in probe arra>s ^ rf oligODUcleot ; d e analogues having bulges or 

provides several benefits as compared to standard obgo- to , ementarv regioils; Specific RNA 

nucleotide arrays. For instance, as discussed supra certain i» ^ often nized by proteins (e g., TAR RNA is 

oligonucleotide analogues have enhanced hybndizauon jq * by me XAX prote5n 0 f HIV). Accordingly. tibrar- 

characteristics to complementary nucleic acids as compared ^ oligonucleotide analogue bulges or loops are useful in 

with oligonucleotides made of naturally otxurrrog uuete- ^ numb „ of diagnostic applications. The bulge or loop can 

otides. One primary benefit of enhanced hybridization cnar- oligonucleotide analogue or linker portions, 

acteristics is that oligonucleotide analogue probes are °c p 6 ^ can te configured in , 

optionally shorter than corresponding probes which do not ^ J^^T^ ffiiment, the unimolecular 

include nucleotide analogues. - orobe s comprise linkers, for example, where the probe is 

Standard oligonucleotide probe arrays typicaUy require £ Y-V-X-V-X*. in 

fairly long probes (about 15-25 nucleotides) to achieve arranged km g ^ ^ ^ an(j ^ a 

strong binding to target nucleic acids. The use complementary oligonucleotides or oligonucleotide 

probes is disadvantageous for two reasons First, the longer M P £ f resents a bo nd or a space r, and L 2 repre- 

the probe, the more synthetic steps must be performed to * ^- * fa suffident length suc h that X 1 

make the probe and any probe array comprising the probe. ^ ^ double-stranded oUgonucleotide. The general 

This increases the cost of making the _ P«^«™ synlhetic ^ conformational strategy used in generating the 

Furthermore, as each synthetic step results in less than 100% do ; ble . stranded un imolecular probes is simflar to that 

coupling for every nucleotide, the quality of the probes M aoum ^-nding application Ser. No. 08/327 687, 
degrades as they become longer. Secondly ^fthat any ofthe elements of the probe (L\ X'. L and 

provide better mis-match discrimination for hybridization to comorises a nucleotide or an oUgonucleotide analogue, 

a target nucleic acid. This is because a single base mismatch ^ embodiment X 1 is an oligonucleotide 

for a short probe-target hybridization is less destabilizing ™^™° e ' * 

than a single mismatch for a long probe-target hybndizauon. M anaiogu . optionally 

nation m probe arrays. ..... 35 oenerallv have the structure— X 1 — Z— X 2 wherein X" and 

The enhanced hybridization charactenstics of ohgonuc e- | e ^^ m ™^ oligonucleotide analogues and Z is 

otide analogues also allows for the creation o obgonucle- X ^XCent presented away from the surface of the 

otide analogue probe arrays where the ^probev mthe an ays jsuu P ^ ^ fof a ^ 

have substantial secondary structure. For instance, the oh- ^ * p , or a toxin> ven0 m, viral epitope, hormone, 
^^^t^TX " SS»W*» P-in, anti*dy or the 

^•esuaM » ?■ -rt for detec,ion of 8 po,ymorphism 

self-complementary regions. Libraries of diverse double- in a target oligonucleotide 

stranded oligonucleotide analogue probes are used, for 4J In diagnostic applications, oligonucleotide analogue 

example, in screening studies to determine binding affinity arrays ( e . g ., arrays on chips, slides or beads) are used to 

of nucleic acid binding proteins, drugs, or oligonucleotides determine whether there are any differences between a 

Cee to examine triple helix formation). Specific oligonucle- re f efe ace sequence and a target ohgonucleotide, e.g., 

otide' analogues are known to be conducive to the formation whet her an individual has a mutation or polymorphism in a 
of unusual secondary structure. Sec, Durland (1995) Bio- J0 known ge ne. As discussed supra, the ohgonucleoude target 

coniueate Chan 6: 278-282. General strategies for usiog ^ optionally a nucleic acid such as a PCR amphcon wnicn 

unimolecular double-stranded oligonucleotides as probes comprises one or more nucleotide analogues. In one 

and for library generation is described in application Ser. No embodiment, arrays are designed to contain probes exmb- 

08/327,687, and similar strategies are applicable to oligo- iting complementarity to one or more >^J*™* 
nucleotide analogue probes. 55 sequence whose sequence is ^ B ^.^™ ££J 

in oeneral a solid support, which optionally has an read a target sequence compnsuig either the reference 

J^ ^ MkLTScbcd to the distal end of the sequence itself or variants of that sequence Any polynucle- 

££k*fe mtaro Probe T? c probe is attached as a otide of known sequence is selected as a reference sequence. 

sa^^^i ^ r ges is 

i- .._i m »m* on^inmip arravs are fully double-stranded, patients. For example, the CFJR gene ana rzogene w 

K^; m ,„.. molecules other than serve to identify pathogenic microorganisms and/or are the 

site of muutions by which such microorganisms acquire 
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drug resistance (e.g., the HIV reverse transcriptase gene for throughout the length of the probe. However probes paving 

ffl V g SS. Other reference sequences of interest a segment or segments of perfect complementanty that is/are 

include regions where polymorphic variations are known to flanked by leading or trailing sequences lacking comple- 

occur (eg the D-loop region of mitochondrial DNA). mentarity to the reference sequence can also be used. Within 

These reference sequences also have utility for, e.g., 5 a segment of complementarity, each probe in the first probe 

forensic, cladistic, or epidemiological studies. set has at least one interrogation position that corresponds to 

Other reference sequences of interest include those from a nucleotide in the reference sequence. The interrogation 

the cenome of pathogenic viruses (e.g., hepatitis (A, B, or position is aligned with the corresponding nucleotide in the 

Q herpes virus (e.g., VZV, HSV-1, HAV-6, HSV-II, CMV, reference sequence when the probe and reference sequence 

and Epstein Barr virus), adenovirus, influenza virus, 10 are aligned to maximize complementarity between the two. 

flaviviruses, echovirus, rhinovirus, coxsackie virus, if a probe has more than one interrogation position, each 

cornovirus 'respiratory syncytial virus, mumps virus, corresponds with a respective nucleotide in the reference 

rotavirus, measles virus, rubella virus, parvovirus, vaccinia sequence. The identity of an interrogation position and 

virus HTLV virus, dengue virus, papillomavirus, mollus- corresponding nucleotide in a particular probe in the first 

cum virus poliovir\is, rabies virus, JC virus and arboviral 15 pro be set cannot be determined simply by inspection of the 

encephalitis virus. Other reference sequences of interest are pro be in the first set. An interrogation position and corre- 

from genomes or episomes of pathogenic bacteria, partial- sponding nucleotide is defined by the comparative structures 

larly regions that confer drug resistance or allow phylogenic of probes in the first probe set and corresponding probes 

characterization of the host (e.g., 16S rRNA or correspond- from additional probe sets. 

ing DNA). For example, such bacteria include chlamydia, 2 o For each probe in the first set, there are, for purposes of 
rickettsial bacteria, mycobacteria, staphylococci, treptocci, me presenl illustration, multiple corresponding probes from 
pneumonococci, meningococci and conococci, klebsiella, additional probe sets. For instance, there are optionally 
proteus, serratia, pseudomonas, legionella, diphtheria, pro bes corresponding to each nucleotide of interest in the 
salmonella, bacilli, cholera, tetanus, botulism, anthrax, reference sequence. Each of the corresponding probes has an 
plague, leptospirosis, and Lymes disease bacteria. Other 25 interrogation position aligned with that nucleotide of inter- 
reference sequences of interest include those in which est Usually, the probes from the additional probe sets are 
mutations result in the following autosomal recessive dis- identical to the corresponding probe from the first probe set 
orders: sickle cell anemia, P-thalassemia, phenylketonuria, with one exception. The exception is that at the interrogation 
galactosemia, Wilson's disease, hemochromatosis, severe position, which occurs in the same position in each of the 
combined immunodeficiency, alpha-l-antitrypsin 30 corresponding probes from the additional probe sets. This 
deficiency, albinism, alkaptonuria, lysosomal storage dis- pos i t ion is occupied by a different nucleotide in the corre- 
eases and Ehlers-Danlos syndrome. Other reference . sp0 nding probe sets. Other tiling strategies are also 
sequences of interest include those in which mutations result employed, depending on the information to be obtained. : 
in X-linked recessive disorders: hemophilia, glucose-6- Tfae probes m oligonucleotide analogues which are 
phosphate dehydrogenase, agammaglobulimenia, diabetes 35 capablc of hybridizing with a target nucleic sequence by 
insipidus, Lesch-Nyhan syndrome, muscular dystrophy, comp i em entary base-pairing. Complementary base pairing 
Wiskott-Aldrich syndrome, Fabry's disease and fragile states sequence-specific base pairing, which comprises, 
X-syndrome. Other reference sequences of interest includes Watson-Crick base pairing or other forms of base 
those in which mutations result in the following autosomal ' d £ Bg sucri as Hoogsteen base pairing. The probes are 
dominant disorders: familial hypercholesterolemia, polycys- 40 atuched by appropriate linkage to a support. 3' attach- 
tic kidney disease, Huntington's disease, hereditary ment ^ more usual as this orientation is compatible with the 
spherocytosis, Marfan's syndrome, von Willebrand's pre f errec j chemistry used in solid phase synthesis of oligo- 
disease, neurofibromatosis, tuberous sclerosis, hereditary nuc i eo tides and oligonucleotide analogues (with the excep- 
hemorrhagic telangiectasia, familial colonic polyposis, ^ q ^ c g ^ ana i 0 gu e s which do not have a phosphate 
Ehlers-Danlos syndrome, myotonic dystrophy, muscular 45 backb0 Q ej such ^ peotide nucleic acids), 
dystrophy, osteogenesis imperfecta, acute intermittent 

porphyria, and von Hippel-Lindau disease. EXAMPLES 

Although an array of oligonucleotide analogue probes is ^ foUowin examp ies are P rovided by way of illustra- 

usually laid down in rows and columns for simplified data ^ ^ ^ fay way of ^^0^ A variety of param- 

processing, such a physical, arrangement of probes on the 50 ^ ch ^ or mo dified to yield essentially similar . 
solid substrate is not essential. Provided that the spatial 

location of each probe in an array *^_^^J^ One approach to enhancing oligonucleotide hybridization 

the probes is collected and processed o fa t0 mc ^ the thermal stability (TJ of the duplex formed 

of a target irrespective of the JJL the target and the prcbe using obgonucleotide 

probes on, e.g., a chip. In processing to data, to : hybnd 55 * j » s on hy bridiza- 

ization signals from the respective probes is Sito DNA. Enhanced hybridization using oUgonucleotide 

any conceptual array desired for subsequent data reducUon, uon to u £ 

whatever the physical arrangement of probes on the sub- Jgd^S^^ m oUgon ucleodde arrays, 
strate. 

In one embodiment, a basic tiling strategy provides an 60 Example 1 

array of immobilized P"^^^ Solution oligonucleotide melting T m 

nucleotide showing a high degree of sequence > w^nV J» y-O-melhyl oligonucleotide analogues was 

one or more selected reference oligonucleotide (e.g., detec- me i„ oiz u mem, , & * 

tion of a point mutation in a target sequence). For instance, compared to the 7 b ^H? of J-0 meSyl 

a first p»be set comprises a plurality of probes erfubhmg « g^-jgj;. J^^'SfcStaASlK 

MSrtS 2C SSESSSL in solution w * afco de.rmined. 
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The T was determined by varying the sample temperature 
and monitoring the absorbance of the sample solution at 260 
nm. The oligonucleotide samples were dissolved in a 0.1M 
NaCl solution with an oligonucleotide concentration of 2 
«M Table 1 summarizes the results of the experiment. The 
results show that the hybridization of DNA in solution has 
approximately the same .T^ as the hybridization of DNA 
with a 2'-0-methyl-substituted oligonucleotide analogue 
The results also show that the T m for the 2'-0-metbyl 



corresponding wr^*-^ ^>»j* — - — ---- ~ 
otide duplex, which is higher than the T m for the corre- 
sponding DNA:DNA or RNA:DNA duplex. 



Solution Oligonucleotide Melting Experiments 
(+) - Target Sequence 
(5 , <rrcAACGGTAGCATCrrGAC-3XSEQ ID NO: 6T 

(-) - Complementary Sequence 
(^TrAAGATr ^ArmTrc^^-^f RFn m NQ: 



rate of increase in intensity was then plotted for each probe 
position. The rate of increase in intensity was similar for 
both targets in the 8-mer probe arrays, but the 12-mer probes 
hybridized more rapidly to the DNA target oligonucleotide. 

Plots of intensity versus probe position were generated for 
the RNA, DNA and 2-O-methyl oligonucleotides to ascer- 
tain mismatch discrimination. The 8-mer probes displayed 
similar mismatch discrimination against all targets. The 

" - 1- - - *i — *i — t th* ^nrrp. nation. 

Thermal equilibrium experiments were performed by 
hybridizing each of the targets to the chip for 90 minutes at 
15 5 0 C. temperature intervals. The chip was hybridized with 
the target in 5x SSPE at a target concentration of 10 nM. 
Intensity measurements were taken at the end of the 90 
minute hybridization at each temperature point as described 
above. All of the targets displayed similar stability, with 
20 minimal hybridization to the 8-mer probes at 30° C. In 
addition, all of the targets showed similar stability in hybrid- 
izing to the 12-mer probes. Thus, the 2*-0-methyl oligo- 
nucleotide target had similar hybridization characteristics^ to 
DNA and RNA targets when hybridized against DNA 
probes. 

Example 3 

2'-0-methyl-substituted oligonucleotide chips 
DMT-protected DNA and 2*-0-methyl phosphoramidites 
were used to synthesize 8-mer probe arrays on a glass slide 
using the VLSIPS™ method. The resulting chip was hybrid- 
ized to DNA and RNA targets in separate experiments. The 
target sequence, the sequences of the probes on the chip and 
the general physical layout of the chip is described in Table 
3 

The chip was hybridized to the RNA and DNA targets in 
successive experiments. The hybridization conditions used 
were 10 nM target, in 5x SSPE. The chip and solution were 
heated from 20° C. to 50° C, with a fluorescence measure- 
40 ment taken at 5 degree intervals as described in 5>N FCl/ 
US94/12305. The chip and solution were maintained at each 
temperature for 90 minutes prior to fluorescence measure- 
ments. The results of the experiment showed that DNA 




DNA(+) 
DNA(+) 
2*OMe(+) 
2XDMe(+) 
RNA(+) 
RNA(+) 



DNA(-) 
2'OMe(-) 
DNA(-) 
2'OMe(-) 
DNA(-) 
2"OMe(-) 



61.6 
58.6 
61.6 
78.0 
58.2 
73.6 
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•T refers to thymine for the DNA oligonucleotides, or uracil for the RNA 
oligonucleotides. 

. Example 2 

Array hybridization experiments with DNA chips and 
oligonucleotide analogue targets 

A variable length DNA probe array on a chip was 
designed to discriminate single base mismatches in the 3 
ding sequences 
5'!cTGAACGGTAGCArCTTGAC-3' (SEQ ID NO:6) 
(DNA larget), S'-CUGAACGGUAGCAUCUUGAC-S 
(SEQ ID NO:8) (RNA target) and 
S'-CUGAACGGUAGCAUCUUGAC-a 1 (SEQ ID NO:9) 
(2-0-methyl oligonucleotide target), and generated by the 



30 



35 



(2-0-methyl oligonucleotide target), and generaieo oy me "~ ' . - 2 '-0-methyl oligdnucle- 

( VLSIPS™ y procedure.mChip was designed wjhadjacen ^ ^^^SSS£^^^^^ 
12-mers and 8-mers which overlapped with the 3 target ouae j^^P^, aM ' logue oligonuc i eo tide probes 



sequences as shown in Table 2. 



Tferget 1 (DNA) 
8-mer probe (complement) 
12-mer probe (complement) 
Target 2 (RNA) 
8-mer probe (complement) 
12-mei probe (complement) 
Target 3 (2*-0-Me oligo) 
8-mer probe (complement) 
12-mer probe (complement) 



TABLE 2 

Array hybridization Experiments 

5 *-CTC AACGGTAGCATCTTG AC- 3 ' (SEQ ID NO: d) 



5--CUGAACXX5UAGCAUCUUGAC-3' (SEQ ID NO: 8) 



5'-CUGAACGGUAGCAUCUUGAC-3' (SEQ ID NO: 9) 



Target oligos were synthesized using standard techniques. 
The DNA and y-O-methyl oligonucleotide analogue target 
oligonucleotides were hybridized to the chip at a concen- 
tration of 10 nM in 5x SSPE at 20° C. in sequential 
experiments. Intensity measurements were taken at each 
probe position in the 8-mer and 12-mer arrays over time. The 



showed dramatically better hybridization to the RNA target 
than the DNA probes. In addition, the 2'-0-methyl analogue 
oligonucleotide probes showed superior mismatch discmm- 
65 nation of the RNA target compared to the DNA probes Hie 
difference in fluorescence intensity between the matched and 
mismatched analogue probes was greater than the difference 
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olieodeoxynucleotide arrays were constructed using 
VLSIPS™ methodology and S'-O-MeNPOC-protected 
deoxynucleoside phosphoramidites. Each array was com- 
prised of the following set of probes based on the sequence 
5 (3'VCArCGTAGAA-(5') (SEQ ID NO:l): 

1 iHEGVtt'KArNjGTAGAA-e-) (SEQ ID NO:14) 
MHEGH3')-CATCN 2 TAGAA<5') (SEQ ID NO:15) 
3"!{HEG)-(3')-CATCGN3AGAA-(50 (SEQ ID NO:16) 
aIhEGHS'J-CATCGTN.GAA^) (SEQ ID NO.17) 
io where HEG-hexaethyleneglycol linker, and N is either 
A.G C or T so that probes are obtained which contain single 
mismatches introduced at each of four central locations in 
the sequence. The first probe array was constructed with all 
natural bases. In the second array, 2-amino-2 - 
is deoxyadenosine (D) was used in place of adenosine {Ay 
Both arrays were hybridized with a S'-fluorescem-labeled 
olieodeoxynucleotide target, (5')-Fl-d 
(CTGAACGGTAGCATCTTGAC)-(3') (SEQ ID NO:18), 
which contained a sequence (in bold) complementary to the 
20 base probe sequence. The hybridization conditions .were: 10 
nM target in 5x SSPE buffer at 22° C. wuh agitation After 
30 minutes, the chip was mounted on the flowcell of a 
scanning laser confocal fluorescence microscope, rinsed 
briefly with 5x SSPE buffer at 22° C, and then a surface 
is fluorescence image was obtained. 

The relative efficiency of hybridization of the target to the 
complementary and single-base mismatched probes was 
determined by comparing the average bound surface fluo- 
rescence intensity in those regions of the of the array 
30 containing the individual probe sequences. The results (FIG. 
Example 4 3) show that a 2-amino-2'-deoxyadenosine (D) substitution 

Synthesis of olieoniicleotide analogues m a heterogeneous probe sequence is a relatively neutral 

Sfrea^ent Koc-Cl group reacts non-selecUvely one , with UtUe effect on either the signal intensity or the 
widJboto the 5' and^droxyb on 2'-0-methvl nucleoside s ^ dty of DNA-DNA hybridization, under condflons 
2o™t Thus, to generate high yields of 5'-0-MeNPoc- 3J ^re J Urget is in excess and the probes are saturated. 
2'-0-methylribonucleoside analogues for use in ohgonucle- fi 
... ....th.ctc the followine protection- r . 



between the matched and mismatched DNA probes, dra- 
maticaUy increasing the signal-to-noise ratio FIG. 1 de- 
plays the results graphically (FIGS. 1A and IB). (M) and (P) 
indicate mismatched and perfectly matched probes, respec- 
lively (FIGS. 1C and ID) illustrates the fluorescence inten- 
sity -.versus location on an example chip for the various 
probes at 20° C. using DNA and RNA targets, respectively. 

TABLE 3 



y.CrcAACGGTAGCArCTTGAC-3- 
(SEQ ID NO: 6) 

y-CUOAACGGUAGCAUCUUGAC-y 
(SEQ ID NO: 8) 

Matching DNA oUgonucleorfde y-CTTGOCAT (SEQ ID NO: 10) 

El^iS* F-CUUGCCAU (SEQ ID NO: It) 

oligonucleotide analogue probe 

DNA M olff cleoOd* p«»b« wuh y-CTTOCTAT (SEQ ID NO: 12) 

2^^^ y-COUGCUAUCSEQIDNO:^) 

analogue probe with 1 base 
mismatch {2'OMe (M)} 



Matching 2'-0-methyl oligonucleotide analogue probe 
T-O-meihyl oligonucleotide analogue probe with 1 base mismatch 
DNA oligonucleotide probe with 1 base mismatch 
Matching DNA oligonucleotide probe 



otide analogue synthesis, the following protection- 
deprotection scheme was utilized. 

The protective group DMT was added to the 5'-0 position 
of the 2'-0-methylribonucleoside analogue in the presence 
of pyridine. The resulting 5^0-DMTprotected .analogue 
was reacted with TBDMS-Triflate in THF, resulting in the 
addition of the TBDMS group to the 3'-0 of the an alogue 
The 5'-DMT group was then removed with TCAA to yield 
a free OH group at the 5' position of the 2^0-mea,y 

. " r P. r i kv itic. addition of 
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Hybridization to a dA-homopolymer oligodeoxynucle- 
otide probe substituted with 2-amino-2'-deoxyadenosine (D) 
The foUowing experiment was performed to compare the 
hybridization of 2'-deoxyadenosine containing bomopoly- 
mer arrays with 2-amino-2 , -deoxyadenosine homopolymer 
arrays. The experiment was performed on two 11-mer oli- 
eodeoxynucleotide probe containing arrays. Two 11-mer 
.. , „,„h» cnniMKs were synthesized on 



a free OH group at the 5* position of the 2'-0-methyl * ^^cieotide probe sequences were synthesized on 
ribonucleoside analogue, followed by the addition of 45 K J> 5 -.0-MeNPOC-protected nucleoside pbos- 
MeNPoc-Q in the presence of pyridine, to yield 5 -0- ^ sUmdard VLSIPS™ methodology. 

MeNPoc-y-O-TBDMS-r-O-methyl ribonucleoside ana- vu 



loEiie The TBDMS group was then removed by reaction 
with NaF, and the 3'-OH group was phosphitylated using 
standard techniques. . . •. 

Two other potential strategies did not result in high 
specific yields of ^0-MeOToc-2^0-methylribonucleoside. 
I^the fust, a less reactive MeNPoc derivative was synthe- 
sized 



phoramidites and standard VLSIPS™ methodology. 

The sequence of the first probe was: (HEG)-(3)-d 
(AAAAANAAAAAK50 (SEQ ID NO:19); where HEG- 
«, hexaethyleneglycol linker, and N is either A.G.C or T. The 
seLd probe was the same, except that dA was replaced by 
^-atnmo^-deoxyadenosine (D). The chip was hybridized 
with a 5'-fluorescein-labeled oligodeoxyniicleotide target 
«?n-d(TrmGTrTTT)-(30 (SEQ ID NO:20), which 



^ ^^SvmfS to react exclusively with where N=C. Hybridization conditions were 10 nM Urget m 
^ZZ^^O-S^l^ analogue. 5x ss PE buffer at 22" C. ^*^J*Z£™£Z 
iT.h, 2m an oreanotin protection scheme was ^ chip W as mounted on the flowcell of a scanmng laser 
usia dSB ST reacted with the 2'-0- fluoresC ence microscope, rinsed briefly with 5x 

™ivlS»n3Sdfcm^ followed by reaction with ^ buffer at 22 ° C . (l°w stringency), and a surface 
MeNPoc ?7ott .To-MeNPoc and 3'-0-MeNPoc 2'-0- 40 fluotescence - migt was obtained. Hybridization tc .the chip 
methylribonucleoside analogues were obtained. 

Example 5 

Hybridization to mixed-sequence oUgodeoxynucleotide 
probes substituted with 2-amino-2'-deoxyadenosine (D) 

To test the effect of a 2-amino-2'-deoxyadenosine (D) 
substitution in a heterogeneous probe sequence, two 4x4 



Oof E DUiicj at v« v w ° " ' . , , . 

fluorescence image was obtained. Hybridization to the chip 
was continued for another 5 hours, and a surface fluores- 
cence image was acquired again. Finally, the chip was 
washed briefly with 05x SSPE (high-stringency), then with 
5x SSPE, and re-scanned. 

The relative efficiency of hybridization of the target to the 
complementary and single-base mismatched probes was 
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determined by -paring the average bouod surf** fluo- ^A 16x* ^S^^^SgJa 

SSL phosphoJfdi.es, including the analogs ddj 

S Ste thaT substituting 2'-deoxyadenosine with aod dl. He array was compnsed of the set of probes 

2-amino-2 , -deoxyadenosine in a d(A)„ homopolymer probe , represente d by the following ^ u ^^">;^ , fi ! 

sequence results in a significant enhancement in specific G T T G, G 2 G 3 G 4 G 5 C G G G TK50; (SEQ. ID NO:28) 

hybridization to a complementary oligodeoxynucleotide whefe underlined bases are fixed, and the five internal 

sequence. deoxyguanosines (G,. 5 ) are substituted with G, ddG, dl, and 

Examde 7 T in all possible (1024 total) combinations. A complemen- 

^_ _ . , . „„. .„ in t-rv olieonucleotide target, labeled with fluorescein at the 

Hybridization to alternating A-T oligodeoxynucleot.de io Ury <*&™* AAT ^ C AA c C C C C G C C C A 

probes substituted with 5-propynyl-2'-deoxyundine {?) ana ■> - • w v ^ m ^ was hybridized t0 me array. 

2-amino-2'-deoxyadenosine (D) Drotec ted Hie hybridization conditions were: 5 nM target in 6x SSPE 

Commercially available 5 -DMT-proteciea i j _ shaking. After 30 minutes, the chip was 

T-deoxynucleoside/nucleoside-analog phosphoramidites buffer ^at 22 ^ wtn *^S f ~ AJ ^ metrix scanning laser 

(Glen RLearch) were used to synmesize two decanuckotide ,5 mounted on ^^c«S»A^ "wilh 055 x 

initiaUvmodmedwilhaterminal-MeOTOC-protectedhexa- was acquired. , 

etbyleneglycol linker. The substrate was exposed to light ^ "efficiency" of target hybridization to each probe in 

through a mask to remove the protecting group from the 20 me array ^ p rop ortional to the bound surface fluorescence 

linker in a checkerboard pattern. The first probe sequence - mlt1is iiy m the region of the chip where the probe was 

was then synthesized in the exposed region using DM I- synthesize d.The relative values for a subset of probes (those 

phosphoramidites with acid-deprotection cycles and the conlaming d G-ddG and dG^dl substitutions only) are 

sequence was finally capped with (MeO) 2 PNiPr 2 /tetrazole in piG 6 Substitut i 0 n of guanosine with 

followed by oxidation. A second checkerboard exposure in y ?-d uanosine within the internal run of five G's results 

a different (previously unexposed) region of the cnip was ^ ^ s i gQ ificant' enhancement in the fluorescence signal 

then performed, and the second probe sequence was syn- . which meaS ures hybridization. Deoxyinosine sub- 

thesized by the same V^^J^^^S^ stitutions also enhance hybridization to the probe, but to a 
"control" probe was: -(HEG)<3)-CGCGCCO ) cxtent In this example, the best overall enhancement 

(SEQ ID N0:21); and the sequence of the second probe was ^ Jj^^™ dQ fa subs tituted with 

f^SiS^PJ^^ (SEQ ID NO:22) . . 7-deaza-dG, with the substitutions * distributed evenly 
Sffi throughout the run (i.e., alternating dG/deaza-dG). 

3'iHEGH^ Examole9 
4^HEGH3>d(DPDPDDPDPD)-(5') (SEQ ID NO:25) Example 9 

where HEG-hexaethyleneglycol linker, A-2 - 35 s thesis of s^MeNPOC-r-deoxyinosine-y^N- 

deoxyadenosine, T-thymidine D-2.amino-z - diisoprop yi.2^yanoethyl)phosphoramidite 

we« 10 riJ Zgci in 5x lIpE buffer at 22' C. with gentle fa 20 J dry DCM was then added dropw*e wuh ^sumng 

S After 3 hours, the chip was mounted on theflowcell over go minutes. After 60 minutes, the cold bath was 

of ascannine laser confocal fluorescence microscope, rinsed ^oved. and the solution was allowed to stir overnight at 

briefly with 5x SSPE buffer at 22° C, and then a surface 45 ^ temperature. Pyridine and DCM were i amoved by 

fluorescence image was obtained. Hybridization to the chip evaporation> 50 0 ml of ethyl acetate was added, and the 
was continued overnight (total hybridization time-20br), was wasne d twice with water and then with bnne 

and a surface fluorescence image was acquired again. (200 ml each ). The aqueous washes were combined and 

The relative efficiency of hybridization of the target to roe back<xtracte d twice with ethyl acetate, and then all of the 
A/T and substituted A/T probes was determined by compar- so ^ ^ combined, dried with NajSO^, and 

ing the average surface fluorescence intensity bound to those * |ed under vacuum. The product was recrystallized 

patts of the chip containing the A/T or substituted probe to e p ^ $ Q ^ of pm 5 ,. 0 . 

the fluorescence intensity bound to the G/C control probe M woc _ 2 ^ Mxyinosine „ a yellow solid (99% purity,, 

sequence. T1»^F9^^S!S?SS£S 55 «2S to 'H-NMR and HPLC analysis). 

rf£ Senl;while I D- & P-substituted A/T probe (1 i 5 ^.66 ml; 5.5 mmole) and 0.47 g (2.7 mmole) of 

bound nearly as much (90%) as the G/C-probe. Moreover. w d iisopropylammonium tetrazoUde according to pub- 

UieTneUcs of hybridization are such that, at early times, roe ^ ^re of Barone, et al. (Nude* Acids Res. (1984) 

amount of target bound to the substituted A/T probes J2 ^ crude phosphoramidite was punfi^ I by 

exceeds that which is bound to the all-G/C probe. flash chromatography on silica gel (90:8:2 DCM-MeOH- 

c „„u s EUN) cc-evaporated twice with anhydrous acetomtnte and 

Examples S under vacuum for -24 hours to obtain 2.8 g (80%) of 

Hybridization to oligodeoxyouctotid^ « £ ^ „ , Uow ^ (98% pu rity as determined 

th 7-deaza-2'-deoxyguanosme (ddG) and 2 -deoxyinosine f H/ M>. NMR ^d HPLC). 



with 
(dl) 
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Example 10 

Synthesis of 5'-Mem»OC-7-deaza-2'-deoxy(N2- 
isobutyryl)-guanosine-3'-{N,N-diisopropyl-2-cyanoethyl) 

phosphoramidite. 

The protected nucleoside 7-deaza-2'-deoxy(N2- 
isobutyryl)guanosine (1.0 g, 3 mmole; Chemgenes Corp., 
Waltham, Mass.) was dried by co-evaporating three times 
with 5 ml anhydrous pyridine and dissolved in 5 ml of dry 



24 

VLSIFS oligonucleoude probe arrays in which all or a 
subset of all guanosine residues are substitutes with 7-deaza- 
2'-deoxyguanosioe and/or 2'-deoxyinosine are highly desir- 
able. This is because guanine-rich regions of nucleic acids 
associate to form multi-stranded structures. For example, 
short tract of G residues in RNA and DNA commonly 
associate to form tetrameric structures (Zimmermann et al., 
(1975) J. MoL Biol 92: 181; Kim, J. (1991) Nature 351: 



with 5 ml anhydrous pyridine and dissolved in 5 ml of dry \ / ■ 335 . ^ ^ Sunquist et a 
pyridine-DCM (75:25 by vol.). The solution was pooled to w f ■ . blem ^ , 0 ehi 
^45' C. (dry ice/C^CN] (under argon and , . sotauonj a >.0S> gJggS^ ^ is J t such structU res may com- 
g (3 J mmole) MeNPOC-Cl in 2 ml dry DCM was tfien ^ ^ hybridization between comple- 
ted dropwise with stirring. After 30 minutes, the cold bath ^^^V^ZtaL However, by substituting 
was removed, and the solution allowed to ESSJ ^ an^rG rich nudeic add sequences, 
= rSSTlItSX Z 2 " l^ly at S o g r more positions within a - of G 



crude material was purified by flash chromatography on 
silica gel (2.5%-5% MeOH in DCM) to yield 1.5 g (88% 
yield) 5 , -MeNPOC-7-deaza-2 , -deoxy(N2-isobutyryl) 
guanosine as a yellow foam. The product was 98% pure 
according to 1 H-NMR and HPLC analysis. 20 

The MeNPOC-nucleoside (1.25 g, 2.2 mmole) wasphos- 
phitylated according to the published procedure of Barone, 
et al. (Nucleic Acids Res. (1984) 12, 4051-61). The crude 
product was purified by flash chromatography on silica gel acmcvwJ ^ 
(60 35 5 hexane-ethyl acetate-Et 3 N), co-evaporated twice ^ PNjaLS . 74: 546*3). 
with anhydrous acetonitrile and dried under vacuum for -24 . • 



particularly at one or more positions within a run of G 
residues, the tendency for such probes to form higher-order 
structures is suppressed, while maintaining essentially the 
same affinity and sequence specificity in double-stranded 
structures. This has been exploited in order to reduce band 
compression in sequencing gels (Mizusawa, et al. (1986) 
NA.R. 14: 1319) to improve target hybridization to G-rich 
probe sequences in VLSIPS arrays. Similar results are 
achieved using inosine (see also, Sanger et al. (1977) 



WILLI OUUJUIWUJ nwwi.v»» 

hours to obtain 13 g (15%) of the pure product as a yellow 
solid (98% purity as determined by W'P-NMR and 
HPLC). 
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Example 11 

Synthesis . of 5 , -MeNPOC-2,6-bis(phenoxyacetyl) -2,6- 
diaminopurine-^-deoxyriboside-SKN.N-diisopropyl^- 

cyanoethyl)phosphoramidite. 

The protected nucleoside 2,6-bis(phenoxyacetyl) -2,6- 
diaminopurine-2'-deoxyriboside (8 mmole, 42 g) was dned 
by coevaporating twice from anhydrous pyridine, dissolved 
in 2:1 pyridine/DCM (17.6 ml) and then cooled to -ACT C. 
MeNPOC-chloride (8 mmole, 2.18 g) was dissolved in 
DCM (6.6 mis) and added to reaction mixture dropwise. The 40 
reaction was allowed to stir overnight with slow warming to 
room temperature. After the overnight stirring, another 2 
mmole (0.6 g) in DCM (1.6 ml) was added to the reaction 
at -40° C. and stirred for an addiUonal 6 hours or until no 
unreacted nucleoside was present. The reaction mixture was 45 
evaporated to dryness, and the residue was dissolved m ethyl 
acetate and washed with water twice, followed by a wash 
with saturated sodium chloride. The organic layer was dried 
with MgSO*, and evaporated to a yellow solid which was 
purified by flash chromatography in DCM employing a 50 
methanol gradient to elute the desired product in 51% yield. 

The 5*-MeNPOC-nucleoside (4.5 mmole, 3.5 g) was 

phosphitylated according to the P ub ^ d ,^^ re ° f 
Barone, et al. (Nucleic Acids Res. (1984) 12, 4051-61). THe 
cmde product was purified by .flash chromatography on 
sflica gel (99:0.5:0.5 DCM-MeOH-E^N). The pooled fac- 
tions were evaporated to an oil, redissolved in a minimum 
amount of DCM, precipitated by the addition of 800 ml ice 
cold hexane, filtered, and then dried under vacuum for -24 

hours. . 
Overall yield was 56%, at greater than 96% punty by 

HPLC and ^P-NMR. 

Example 12 

5'-0-MeNPOC-protected phosphoramidites for incorpo- 65 
rating 7-deaza-2'deoxyguanosine and 2'-deoxyinosme into 
VLSSIPS™ Oligonucleotide Arrays 



For facile incorporation of 7-deaza-2'-deoxyguanosine 
and 2'-deoxyinosine into oligonucleotide arrays using 
VLSIPS™ methods, a nucleoside phosphoramidite compris- 
30 ing the analogue base which has a S'-O'-MeNPOC- 
protecting group is constructed. This building block was 
prepared from commerciaUy available nucleosides accord- 
ing to Scheme I. These amidites pass the usual tests for 
coupling eflBciency and photolysis rate. 




McNPOC-afryridine r 
CHjCh ' 



NH-ibu 



HO 



McNPOC 



55 



60 



McNPOC 




NH-ibu 



NH-ibu 



(IPr2N)2POCE/IPr2NH/TFr_ 
CHzCh " 



HO 
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McNPOC-O 
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CEP 



Although the foregoing invention has been described in 
some detail by way of illustration and example for purposes 
of clarity of understanding, modifications can be made 
thereto without departing from the spirit or scope of the 
appended claims. 



All publications and patent applications cited m this 
10 application are herein incorporated by reference for all 
purposes as if each individual publication or patent appli- 
cation were specifically and individually indicated to be 
incorporated by reference. 



SEQUENCE LISTING 

(1) GENERAL INFORMATION: 

(iii) NUMBER OP SEQUENCES: 29 

(2) INFORMATION FOR SEQ ID NO : 1 : 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:l: 

10 

AAGATGCTAC 

(2) INFORMATION FOR SEQ ID NO:2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:2: 

11 

AAAAANAAAA A 

(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS:. 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:3: 

10 

ATATAATATA 

(2) INFORMATION FOR SEQ ID NO:4: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 
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(ii) MOLECULE TYPE: DNA 

<xi) SEQUENCE DESCRIPTION: SEQ-ID NO: 4 



CGCGCCGCGC 



(2) INFORMATION FOR SEQ ID NO:5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY i linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 6. .10 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "N ■ guanosine (G), 
2* , 3 1 -dideoxyguanine (ddG) , 
2 ' -deoxyinosine <dl) or thymine (T)' 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:5: 

TGGGCNNNNN TTGTA 



(2) INFORMATION FOR SEQ ID NO:6: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20. base pairs 

(B) TYPE: nucleic acid 
'(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION : 1.-20 

(D) OTHER INFORMATION: /note- -Target DNA sequence 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:6: 
CTGAACGGTA GCATCTTGAC 



(2) INFORMATION FOR SEQ ID NO: 7: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPES nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: - 

(D) OTH^IOTORMATION: /note- -Complementary DNA sequence" 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:7: 
GTCAAGATGC TACCGTTCAG 



(2) INFORMATION FOR SEQ ID NO: 8: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear. 
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(ii) MOLECULE TYPE: RNA 

(ix) FEATURE: 

<A) NAHE/KEY: - 

(B) LOCATION: 1..20 • 

(D) OTHER INFORMATION: /note- "Target RNA .sequence 
<xi) SEQUENCE DESCRIPTION : SEQ ID NO:8: 
CUGAACGGUA GCAUCUUGAC 



(2) INFORMATION FOR SEQ ID NO:9: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

* (A) DESCRIPTION: /desc - "2 ' -O-nethyl oligonucleotide 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /mod_base- urn 

(ix) FEATURE: 

(A) NAHE/KEY: modified-base 

(B) LOCATION: 3 

<D) OTHER INFORMATION : /mod-base- gm 

(ix) FEATURE: 

(A) NAME/KEY: modi_fied_base 

(B) LOCATION : 4 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "2 * -O-methyladenosine" 



(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- -2 , -0-methyladenoBine" 

(ix) FEATURE: 

(A) NAHE/KEY: modifiedjaase 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /mod_base- gm 

(ix) FEATURE: 

(A) NAHE/KEY t modified_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /mocLbase- gm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 9 

(D) OTHER INFORMATION: /modJsase- um 

(ix) FEATURE: 

(A) NAHE/KEY: modifiecLbase 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- -2 , -0-methyladenosine* 
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(ix) FEATURE: 

(A) NAME/KEY i modifiecLbase 

(B) LOCATION: 11 

• (D) OTHER INFORMATION: /mocLbase- gm • 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 12 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 13 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- -2*-0-methyladenoBine* 

(ix) FEATURE: 

(A) NAME/KEY i modifiecLbase 

(B) LOCATION: 14 

(D) OTHER INFORMATION: /mo debase- urn 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 15 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY* modifiecLbase 

(B) LOCATION: 16 

(D) OTHER INFORMATION: /mod_base- iim 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 17 

(D). OTHER INFORMATION: /mod_base« um . 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 18 

(D) OTHER INFORMATION: /mocLbase- gm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 19 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- -2 , -0-methyladenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 20 

(D) OTHER INFORMATION : /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..20 

(D) OTHER INFORMATION: /note- "Target 2 -O-metnyl 
oligonucleotide sequence" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO : 9 : 

NNNNNNNNNN NNNNNNNNNN 

(2) INFORMATION FOR SEQ ID NO:10: 

(i) SEQUENCE CHARACTERISTICS: * 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY i linear 

(ii) MOLECULE TYPE: DNA . 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..8 

(D) OTHER INFORMATION: /note- 'Matching CNA oligonucleotide 
probe" 
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(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 
CTTGCCAT 

.*■-.* ' ■ 

(2) INFORMATION FOR SEQ ID NOill: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - '2 '-0-raethyl oligonucleotide' 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION : /mo debase- cm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbaae 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /mocLbase- urn 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 3 

(D) OTHER INFORMATION : /mod_base- urn 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 4 

<D) OTHER INFORMATION: /mo debase- gm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mocLbase- cm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /mocLbase- OTHER 
/note- "2 ' -O-methyl adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbaae 

(B) LOCATION: 8 . 

(D) OTHER INFORMATION: /mocLbase- um 

(ix) FEATURE: 

(A) NAME/KEY l 
. (B) LOCATION: 1.. 8 . 

(D) OTHER INFORMATION: /note-; "Matching 2 ^O-metnyl 
oligonucleotide analogue probe" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 11: 
NNNNNNNN 



(2) INFORMATION FOR SEQ ID NO: 12: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY i linear 

(ii) MOLECULE TYPE: DNA 
(ix) FEATURE: 
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(A) HAKE /KEY I - 

(B) LOCATION: 1..8 

(D) OTHER INFORMATION : /note- 
with l base ciiainntch* 



-continued 



"DNA oligonucleotide probe 



(xi) SEQUENCE DESCRIPTION : SEQ ID NO:12: 



CTTGCTAT 



(2) INFORMATION FOR SEQ ID NO:13: 

(1) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 
<C) STRANDEDNESS: single 
(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - -2 i -0-methyl oligonucleotide 

(ix) FEATURE: 

(A) NAME/KEY: modLfied_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod-base- cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /mocLbase- um 

(ix) FEATURE: 

<A) NAME/KEY: modifiecLbase 
(B) LOCATION : 3 

(D) OTHER INFORMATION: /mocLbase- um 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 4 

(D) OTHER INFORMATION: /mocLbase- gm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mocLbase- cm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- um 

( ix ) FEATURE : 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /mocLbase- OTHER 
/note- *2*-0-methyladenosine* 

(ix) FEATURE t 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 8 . 

• <D) OTHER INFORMATION: /mocLbase- um 

(ix) FEATURE: 

(A) NAME /KEY: - 

(B) LOCATION: 1..8 

(D) OTHER INFORMATION : /note- "2 '-0-methyl oligonucleotide 
analogue probe with 1 base mismatch- 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:13: 
NNNNNNNN 



(2) INFORMATION FOR SEQ ID NO: 14: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 
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(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) PEATURE: 

(A) NAME /KEY: modified_base 

(B) LOCATION: 10 

(D> OTHER INFORMATION: /mocLbase- OTHER 
/note- "N - cytosine covalently 
modified at the 3' phosphate group with 
a hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION I SEQ ID NO:14: 
AAGATGNTAN 



(2) INFORMATION FOR SEQ ID NO: 15: 

( i ) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 10 base paira 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME /KEY: modifiedjbase 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_baoe- OTHER 

/note- "N - cytosine covalently modified 
at the 3* phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xij SEQUENCE DESCRIPTION: SEQ ID NO:lS: 
AAGATNCTAN 



(2) INFORMATION FOR SEQ ID NO: 16 ! 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base paira 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mo debase- OTHER 

/note- "N - cytosine covalently modified 
at the 3* phosphate group with a 
hexaethyleneglycol (HEG) linker" 

* 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:16l. 
AAGANGCTAN 



(2) INFORMATION FOR SEQ ID NO: 17: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE : nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /modjiase- OTHER 

/note- "N - cytosine covalently modified 
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at the 3* phosphate group with a 
hexaethyleneglycol (HEG) linker" 

■ (xi) SEQUENCE DESCRIPTION: SEQ ID.NO:17: 

.AAGNTGCTAN 



(2) INFORMATION FOR SEQ ID NO:18: 

<i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH i 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

<ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

<A) NAME/KEY: modifiecLbase 
(B) LOCATION: 1 

(D) OTHER INFORMATION : /mod_baae- OTHER 

/note- "N - cytosine covalently modifi ed 
at the 5* phosphate group with a 
fluorescein molecule" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 

NTGAACGGTA GCATCTTGAC 



(2) INFORMATION FOR SEQ ID NO: 19: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base. pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 
<D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 11 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *D - adenine covalently modified 
at the 3* phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19: 

AAAAANAAAA N 



(2) INFORMATION FOR SEQ ID NO:20: 

(i) SEQUENCE CHARACTERISTICS! 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 
<C) STRANDEDNESS: single 
(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - thymine covalently modified 
at the 5' phosphate group with a 
fluorescein molecule" 

(xi) SEQUENCE DESCRIPTION: SEQ ID N0:20: 



NTTTTGTTTT T 
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(2) INFORMATION FOR SEQ ID NO: 21: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: Bingle 

(D) TOPOLOGY : linear 

<ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME /KEY: modified_baee 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /modjaase- OTHER 

/note- "N - cytosirie covalently modified 
at the 3* phosphate group with a 
hexaethyleneglycol (HEG) linker" 

<xi) SEQUENCE DESCRIPTION: SEQ ID NO:21: 

CGCGCCGCGN 10 



(2) INFORMATION FOR SEQ ID NO: 22: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 *-deoxynucleoside/nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE:" 

(A) NAME/KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mocLbase- OTHER 
/note- "N - 2 '-deoxyadenosine* 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION : 3 

(D) OTHER INFORMATION t /mod_base- OTHER 
/note- "N - 2 '-deoxyadenosine* 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION : 5 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- *N - 2 * -deoxyadenosine* 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION t 6 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "N - 2 , -deoxyadenosine* 

(ix) FEATURE: 

(A) NAME/KEY i modifieaV_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION : /mod_base- OTHER 
/note- "N - 2 '-deoxyadenosine* 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /modjsase- OTHER 

/note- "N - 2 '-deoxyadenosine covalently 
modified at the 3' phosphate group with 
a hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 22: 

NTNTNNTNTN 
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(2) INFORMATION FOR SEQ ID NO:23t 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH:-10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 ' -deoxynucleoside/ nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE: 

(A) NAME /KEY: niodi£ed_baee 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mcd_baae- OTHER 
/note- '"N - 2 '-de oxy adenosine" 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 5-propyriyl-2'-deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /modjaase- OTHER 
/note- *N - 2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modi£edJ>ase 

(B) LOCATION: 4 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - 5-propynyl-2'-deoxyuridine" 

(ix) FEATURE: 

(A) . NAME/KEY: modi£ed_base 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /modjbase- OTHER 
/note- *N - 2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_baoe 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /modjbase- OTHER 
/note- "N - 2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modined_base 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - 5-propynyl-2'-deoxyuridine* 

(ix) FEATURE: 

(A) NAME/KEY: modi£edJ)ase 

(B) LOCATION: 8 

(D) OTHER INFORMATION I /modjiase- OTHER 
/note- "N - 2 /-deoxyadenosine* 

(ix) FEATURE: 

(A) NAME/KEY: mbdifiedjiase 

(B) LOCATION: 9 

(D) OTHER INFORMATION: /modjsase- OTHER 

/note- *N - 5-propynyl-2'-deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod-base- OTHER 

/note- ~N - 2 '-deoxyadenosine covalently 
modified at the 3' phosphate group with 
a hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:23: 



NNNNNNNNNN 
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(2) INFORMATION FOR SEQ ID NO:24t 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 * -deoxynucleoside/ nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjaase 

(B) LOCATION: 1 

(D) OTHER INFORMATION : /mcd_base- OTHER 

/note- *N - 2 -amino-2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - 2-amino-2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- -N - 2 -amino -2 '-deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 6 

(D) OTHER INFORMATION : /mod_base- OTHER 

/note- "N ■ 2 -amino-2' -deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 2 -amino -2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod-base- OTHER 

/note- "N - 2-amino-2 '-deoxyadenosine 
covalently modified at the 3' 
phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:24: 
NTNTNNTNTN 



10 



(2) INFORMATION FOR SEQ ID NO; 25: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs" 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single ' 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: -other nucleic acid 

(A) DESCRIPTION: /desc - "2 •-deoxynucleoside/nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE: 

(A) NAME/KEY i modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - 2 -amino-2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - 5-propynyl-2*-deoxyuridine* 
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(ix) FEATURE: 

(A) NAME /KEY: modifiedLbaae 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod^ase- OTHER 

/note- *N - 2 -amino-2*-deoxy adenosine* 

(ix) FEATURE: 

(A) NAME/KEY: modifie<Lbase 

(B) LOCATION : 4 

(D) OTHER INFORMATION: /modjiase- OTHER 

/note- "N - 5-propynyl-2'-deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mocLbase- OTHER 

/note- "N - 2 -amino -2 ' -deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mocLbase- OTHER 

/note- *N - 2-amino-2 ' -deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /modjbase- OTHER 

/note- *N - 5-propynyl-2 ' -deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /mocLbase- OTHER . 
/note- *N -. 2-amino-2 ' -deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbaBe 

(B) LOCATION: 9 

(D) OTHER INFORMATION: /mocLbase- OTHER 

/note- *N - 5-propynyl-2 ' -deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 2 -amino -2 ' -deoxy adenosine 
covalently modified at the 3* 
phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 25: 

NNNNNNNNNN 10 



(2) INFORMATION FOR SEQ ID NO: 26: 

(i) SEQUENCE CHARACTERISTICS: "■ 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mocLbase- OTHER 

/note- "N - thymine covalently modified 
at the 5* hydroxyl group with a 
fluorescein molecule" 

(ix) FEATURE: 

(A) NAME/ KEY: modifiecLbase 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mocLbase- OTHER 
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/note- "N - thymine covalently modified 
at the 3* phosphate group with a 
hexaethyleneglycol <HEG) linker which is 
covalently bound to the 5* phosphate 
-group of the 5* guanine (N in pos. 1) of 
SEQ ID NO: 27* 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:26: 

NATATTATAN 10 



(2) INFORMATION POR SEQ ID NO:27! 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME /KEY: modi£ed_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION : /mccLbase- OTHER 

/note- "N • guanine covalently modified 
at the 5' phosphate group with a 
hexaethyleneglycol (HEG) linker which is 
covalently bound to the 3' phosphate 
group of the 3' thymine (N in pos. 10) 
of SEQ ID NO:26- 

<xi) SEQUENCE DESCRIPTION: SEQ ID NO:27: 

NCGCGGCGCG 10 



(2) INFORMATION FOR SEQ ID NO: 23: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modifledjbase 

(B) LOCATION: 6. .10 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "N - guanine (G) , 
2' ,3 '-dideoxyguanine (ddG), 
2 * -deoxyinosine (dl) or thymine (T)" 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 15 

(D) OTHER INFORMATION : /mod_base- OTHER 

/note^ *N - cytosine covalently modified 
at the 5' phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 28: 

TGGGCNNNNN TTGTN 15 



(2) INFORMATION FOR SEQ ID NO: 29: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 21 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 
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ks(ii) MOLECULE TYPE: DNA 

- (ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /rnod_base- OTHER 

/note- "V « cytoslne covalently modified 
at the 5* phosphate group with a 
fluorescein molecule" 

<xi) SEQUENCE DESCRIPTION: SEQ ID NO: 29: 

NAATACAACC CCCGCCCATC C 21 



What is claimed is: 

1. A composition for analyzing interactions between oli- 
gonucleotide targets and oligonucleotide probes comprising 
an array of a plurality of oligonucleotide analogue probes 20 
having different sequences, wherein said oligonucleotide 
analogue probes are coupled to a solid substrate at known 
locations and wherein said plurality of oligonucleotide ana- 
logue probes are selected to bind to complementary oligo- 
nucleotide targets with a similar hybridization stability 25 
across the array. 

2. The composition of claim 1, wherein at least one of said 
oligonucleotide analogue probes is selected to maintain 
hybridization specificity or mismatch discrimination with 
said complementary oligonucleotide targets. 30 

3. The composition of claim 1, wherein at least one of said 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

4. The composition of claim 1, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

5. The composition of claim 2, wherein at least one of said 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

6. The composition of claim 2, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 55 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

7. The composition of claims 1-5 or 6, wherein said solid 
substrate is selected from the group consisting of silica, 60 
polymeric materials, glass, beads, chips, and slides. 

8. The composition of claims 1-5 or 6, wherein said 
composition comprises an array of oligonucleotide analogue 
probes 5 to 20 nucleotides in length. 

9. The composition of claims 1-5 or 6, wherein said array 65 
of oligonucleotide analogue probes comprises a nucleoside 
analogue with the formula 




wherein: 

the nucleoside analogue is not a naturally occurring DNA 
or RNA nucleoside; 

R 1 is selected from the group consisting of hydrogen, 
methyl, hydroxyl, alkoxy, alkythio, halogen, cyano, 
and azido; . 

R 2 is selected from the group consisting of hydrogen, 
methyl, hydroxyl, alkoxy, alkythio, halogen, cyano, 
and azido; 

35 Y is a heterocyclic moiety; 

and wherein said nucleoside analogue is incorporated into 
the oligonucleotide analogue by attachment to a 3* 
hydroxyl of the nucleoside analogue, to a 5* hydroxyl of 
the nucleoside analogue, or both the 3* nucleoside and 
40 the 5' hydroxyl of the nucleoside analogue. 

10. The composition of claims 1-5 or 6, wherein said 
array of 

oligonucleotide analogue probes comprises a nucleoside 
analogue with the formula 



50 




wherein: 

the nucleoside analogue is not a naturally occurring DNA 
or RNA nucleoside; 

R 1 is selected from the group consisting of hydrogen, 
hydroxyl, methyl, methoxy, ethoxy, propoxy, allyloxy, 
propargyloxy, Fluorine, Chlorine, and Bromine; 

R 2 is selected from the group consisting of hydrogen, 
hydroxyl, methyl, methoxy, ethoxy, propoxy, allyloxy, 
propargyloxy, Fluorine, Chlorine, and Bromine; and 
Y is a base selected from the group consisting of 
purines, purine analogues pyrimidines, pyrimidine 
analogues, 3-mtropyrrole and 5-nitroindoIc; 

and wherein said nucleoside analogue is incorporated into 
the oligonucleotide analogue by attachment to a 3* 
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hydroxyl of the nucleoside analogue, to a 5' hydroxyl of 23. The composition of claims 1-5 or 6, wherein at least 

the nucleoside analogue, or both the 3' nucleoside and one of plurality of said oligonucleotide analogue probes 

the 5* hydroxyl of the nucleoside analogue. forms a first duplex with a target oligonucleotide sequence, 

11. The composition of claims 1-5 or 6, wherein each wherein said oligonucleotide analogue probe has a cone- 
probe of said plurality of oligonucleotide analogue probes 5 sponding oligonucleotide sequence that forms a second 
has at least one oligonucleotide analogue, and wherein at duplex with said target oligonucleotide sequence, wherein 
least one of said oligonucleotide analogues comprises a said second duplex is rich in A-T or G-C nucleotide pairs, 
peptide nucleic acid. and wherein said oligonucleotide analogue probe has at least 

12. The composition of claims 1-5 or 6, wherein at least one nucleotide analogue in place of an A, T, G, or C 
one of said plurality of oligonucleotide analogue probes said iQ nucleotide of said corresponding oligonucleotide sequence 
array of oligonucleotide analogue probes is resistant to at a position within said oligonucleotide analogue probe 
RNAase A. such that said first duplex has an increased hybridization 

13. The composition of claims 1-5 or 6, wherein said stability than said second duplex. 

solid substrate is attached to over 1000 different oligonucle- 24. The composition of claim 23, wherein said oligo- 
otide analogue probes. nucleotide analogue probe contains fewer bases than said 

14. The composition of claims 1-5 or 6, wherein each 15 corresponding oligonucleotide sequence. 

probe of said plurality of oligonucleotide analogue probes 25. The composition of claims 1-5 or 6, wherein said 
has at least one oligonucleotide analogue, and wherein at oligonucleotide analogue probe forms a first duplex with a 
least one of said oligonucleotide analogues comprises 2'-0- target oligonucleotide sequence, wherein said oligonucle - 
methyl nucleotides. otide analogue probe has a corresponding oligonucleotide 

15. The composition of claims 1-5 or 6, wherein said 20 sequence that forms a second duplex with said target poly- 
array of oligonucleotide analogue probes and said solid nucleotide sequence, and wherein said oligonucleotide ana- 
substrate comprises a plurality of different oligonucleotide logue probe is shorter than said corresponding polynucle- 
analogue probes, each oligonucleotide analogue probes hav- otide sequence. 

ing the formula: 26. A composition for analyzing the interaction between 

25 an oligonucleotide target and an oligonucleotide probe com- 
Y— L 1 — X 1 — L 2 — X 2 prising an array 'of a plurality of oligonucleotide probes 

having different sequences hybridized to complementary 
wherein, oligonucleotide analogue targets, wherein said oligonucle- 

Y is a solid substrate; otide analogue targets bind to complementary oligonucle - 

X 1 and X 2 are complementary oligonucleotides contain- otide probes with a similar hybridization stability across the 

ing at least one nucleotide analogue; array. 
L 1 is a spacer, 27. The composition of claim 26, wherein at least one of 

L 2 is a linking group having sufficient length such that X 1 said oligonucleotide analogue target is selected to maintain 
and X 2 form a double-stranded oligonucleotide. hybridization specificity or mismatch discrimination with 

16. The composition of claim 15, wherein said composi- „ said complementary oligonucleotide probes. 

tion comprises a library of unimolecular double-stranded 28 - ^ composition of claim 26, wherein at least one of 
oligonucleotide analogue probes. said oligonucleotide analogue targets has increased the 

17. The composition of claims 1-5 or 6, wherein said thermal stability between said oligonucleotide analogue 
array of oligonucleotide analogue probes comprises a con- Ur g et and complementary oligonucleotide probe as 
formationally restricted array of oligonucleotide analogue compared to an oligonucleotide target that is the perfect 
probes with the formula: 40 complement to the complementary oligonucleotide probe 

with which said oligonucleotide analogue target anneals. 
— x 11 — z— X" 29. The composition of claim 26, wherein at least one of 

said oligonucleotide analogue targets has decreased the 
wherein X 11 and X 12 are complementary oligonucleotides thermal stability between said oligonucleotide analogue 
or oligonucleotide analogues and Z is a presented 45 target and said complementary oligonucleotide probe as 
moiety. compared to an oligonucleotide target that is the perfect 

18. The composition of claims 1-5 or 6, wherein each complement to the complementary oligonucleotide probe 
probe of said plurality of oligonucleotide analogue probes with which said oligonucleotide analogue target anneals, 
has at least one oligonucleotide analogue, and wherein at 30. The composition of claim 27, wherein at least one of 
least one of said oligonucleotide analogues comprises, a 50 said oligonucleotide analogue targets has increased the 
nucleotide with a base selected from the group of bases thermal stability between said oligonucleotide analogue 
consisting of 5-propynyluracil, 5-propynylcytosine, target and said complementary oligonucleotide probe" as . 
2-aminoadenine, 7-deazaguanine, 2-aminopurine, 8-aza-7- compared to an oligonucleotide target that is the perfect 
deazaguanine, lH-purine, and hypoxanthine. complement to the complementary oligonucleotide probe 

19. The composition of claims 1-5 or 6, wherein said 55 with which said oligonucleotide analogue target anneals, 
plurality of oligonucleotide analogue probes are coupled to 31. The composition of claim 27, wherein at least one of 
said solid substrate by light-directed chemical coupling. said oligonucleotide analogue targets has decreased the 

20. The composition of claim 19, wherein said solid thermal stability between said oligonucleotide analogue 
substrate is derivitized with a silane reagent prior to syn- target and said complementary oligonucleotide probe as 
thesis of said plurality of oligonucleotide analogue probes. 60 compared to an oligonucleotide target that is the perfect 

21. The composition of claims 1-5 or 6, wherein said complement to the complementary oligonucleotide probe 
plurality of oligonucleotide analogue probes are coupled to with which said oligonucleotide analogue target anneals, 
said solid substrate by flowing oligonucleotide analogue 32. The composition of claims 26-30 or 31, wherein the 
reagents over known locations of the solid substrate. oligonucleotide analogue target is a PCR ampHcon. 

22. The composition of claim 21, wherein said solid 65 33. The composition of claims 26-30 or 31, wherein at 
substrate is derivitized with a silane reagent prior to syn- least one of said plurality of oligonucleotide probes corn- 
thesis of said plurality of oligonucleotide analogue probes. prise at least one oligonucleotide analogue. 
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34. The composition of claims 26-30 or 31, wherein at 
least one target oligonucleotide analogue acid is an RNA 
nucleic acid. 

35. A method analyzing interactions between an oligo- 
nucleotide target and an oligonucleotide probe, comprising 
the steps of: 

(a) , synthesizing an oligonucleotide analogue array com- 
prising a plurality of oligonucleotide analogue probes 
having different sequences, wherein said oligonucle- 
otide analogue probes are coupled to a solid substrate 
at known locations, said solid substrate having a sur- 
face; 

(b) . exposing said oligonucleotide analogue probe array to 
a plurality of oligonucleotide targets under hybridiza- 
tion conditions such that said plurality of oligonucle- 
otide analogue probes bind to complementary oligo- 
nucleotide targets with a similar hybridization stability 
across the array; and 

(c) . determining whether an oligonucleotide analogue 
probe of said oligonucleotide analogue probe array 
binds to at least one of said target nucleic acids. 

36. The method in accordance of claim 35, wherein at 
least one of said oligonucleotide analogue probes is selected 
to maintain hybridization specificity or mismatch discrimi- 
nation with said complementary oligonucleotide targets. 

37. The method in accordance of claim 35, wherein at 
least one of said oligonucleotide analogue probes has 
increased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 
target as compared to an oligonucleotide probe that is the 
perfect complement to the complementary oligonucleotide 
target with which said oligonucleotide analogue probe 
anneals. 

38. The method in accordance of claim 35, wherein at 
least one of said oligonucleotide analogue probes has 
decreased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 
target as compared to an oligonucleotide probe that is the 
perfect complement to the complementary oligonucleotide 
target with which said oligonucleotide analogue probe 
anneals. 

39. The method in accordance of claim 36, wherein at 
least one of said oligonucleotide analogue probes has 
increased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 
target as compared to an oligonucleotide probe that is the 
perfect complement to the complementary oligonucleotide 
target with which said oligonucleotide analogue probe 
anneals. 

. 40. The method in accordance of claim 36, wherein at 
least one of said oligonucleotide analogue probes has 
decreased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 
target as compared to an oligonucleotide probe that is the 
perfect complement to the complementary oligonucleotide 
target with which said oligonucleotide analogue probe 
anneals. 

41. The method of claims 35-39 or 40, wherein said 
oligonucleotide target is selected from the group comprising 
genomic DNA, cDNA, unspliced RNA, mRNA, and rRNA. 

42. The method of claims 35-39 or 40, wherein said target 
nucleic acid is amplified prior to said hybridization step. 

43. The method of claims 35-39 or 40, wherein said 
plurality of oligonucleotide analogue probes is synthesized 
on said solid support by light-directed synthesis. 

44. The method of claims 35-39 or 40, wherein said 
plurality of said oligonucleotide analogue probes is synthe- 
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sized on said solid support by causing oligonucleotide 
analogue synthetic reagents to flow over known locations of 
said solid support. 

45. The method of claims 35-39 or 40, wherein said step 
(a), comprises the steps of: 

i) . forming a plurality of channels adjacent to the surface 
of said substrate; 

ii) . placing selected reagents in said channels to synthe- 
size oligonucleotide analogue probes at known loca- 
tions; and 

iii) . repeating steps i). and ii), thereby forming an array of 
oligonucleotide analogue probes having different 
sequences at known locations on said substrate. 

46. The method of claims 35-39 or 40, wherein said solid 
substrate is selected from the group consisting of beads, 
slides, and chips. 

47. The method of claims 35-39 or 40, wherein said solid 
substrate is comprised of materials selected from the group 
consisting of silica, polymers and glass. 

48. The method of claims 35-39 or 40, wherein the 
oligonucleotide analogue probes of said array are synthe- 
sized using photoremovable protecting groups. 

49. The method of claims 35-39 or 40, further comprising 
selectively incorporating MeNPoc onto the 3' or 5' hydroxyl 
of at least one nucleoside analogue and selectively incorpo- 
rating said nucleoside analogue into at least one of said 
oligonucleotide analogue probes. 

50. The method of claims 35-39 or 40, wherein at least 
one of said oligonucleotide analogue probes is synthesized 
from phosphoramidite nucleoside reagents. 

51. A method of detecting an oligonucleotide target, 
comprising enzymatically copying an oligonucleotide target 
using at least one nucleotide analogue, thereby producing 
multiple oligonucleotide analogue targets, selecting said 
oligonucleotide analogue targets such that said oligonucle- 
otide analogue targets bind to the complementary oligo- 
nucleotide probes coupled to a solid surface at known 
locations of an array with a similar hybridization stability 
across the array, hybridizing the oligonucleotide analogue 
targets to complementary oligonucleotide probes, and 
detecting whether at least one of said oligonuclotide ana- 
logue targets binds to said complementary oligonucleotide 
acid probe. 

52. The method of claim 51, wherein at least one of said 
oligonucleotide analogue targets is selected to maintain 
hybridization specificity or mismatch discrimination with 
said complementary oligonucleotide probes. 

53. The method of claim 51, wherein at least one of said 
oligonucleotide analogue targets has increased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 

54. The method of claim 51, wherein at least one of said 
oligonucleotide analogue targets has decreased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 

55. The method of claim 52, wherein at least one of said 
oligonucleotide analogue targets has increased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 
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56. The method of claim 52, wherein at least one of said 
oligonucleotide analogue targets has decreased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 5 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 

57. The method of claims 51-55 or 56, wherein the' 
oligonucleotide probe array comprises at least one oligo- 
nucleotide analogue probe which is complementary to at 10 
least one of said oligonucleotide analogue targets. 

58. A method of making an array of oligonucleotide 
probes, comprising providing a plurality of oligonucleotide 
analogue probes having at least one oligonucleotide 
analogue, said oligonucleotide analogue probes having dif- 15 
ferent sequences at known locations on an array, selecting 
the oligonucleotide analogue probes to hybridize with 
complementary oligonucleotide target sequences under 
hybridization conditions such that said oligonucleotide ana- 
logue probes bind to complementary oligonucleotide targets 20 
with a similar hybridization stability, across the array. 

59. The method of claim 58, wherein at least one of said 
oligonucleotide analogue probes is selected to maintain 
hybridization specificity or mismatch discrimination with 
said complementary oligonucleotide targets. 25 

60. The method of claim 58, wherein at least one of said 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 30 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals; 

61. The method of claim 58, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 35 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

62. The method of claim 59, wherein at least one of said 40 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 45 
oligonucleotide analogue probe anneals. 

63. The method of claim 59, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 50 
an oligonucleotide probe that is the perfect complement to 



the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

64. The method in accordance with claims 58-62, or 63, 
further comprising incorporating at least one oligonucle- 
otide analogue into at least one of the oligonucleotide 
analogue probes of the array to reduce or prevent the 
formation of secondary structure in the oligonucleotide of 
the array. 

65. The method in accordance with claims 58-62, or 63, 
further comprising incorporating at least one oligonucle- 
otide analogue into at least one of the oligonucleotide target 
to reduce or prevent the formation of secondary structure in 
the target polynucleotide sequence. 

66. The method in accordance with claims 58-62, or 63, 
further comprising incorporating at least one oligonucle- 
otide analogue into at least one of the oligonucleotide 
analogue probes of the array to create secondary structure in 
the oligonucleotide of the array. 

67. The method in accordance with claims 58-62, or 63, 
further comprising incorporating a base selected from the 
group consisting of 5-propynyluracil, 5-propynylcytosine, 
2-aminoadenine, 7-deazaguanine, 2-aminopurine, 8-aza-7- 
deazaguanine, IH-purine, and bypoxanthine into the oligo- 
nucleotide analogue probes of the array. 

68. The method of claim 67 further comprising selecting 
said at least one' oligonucleotide analogue such that the 
oligonucleotide analogue probe is a homopolymer. 

69. The method in accordance with claims 58-62, or 63, 
further comprising selecting said at least one oligonucleotide 
analogue from the group consisting essentially of oligo- 
nucleotide analogues comprising 2'-0-methyl nucleotides 
and oligonucleotides comprising a base selected from the 
group of bases consisting of 5 -propynyluracil, 
5-propynylcytosine, 7-deazaguanine, 2-aminoadenine, 
8-aza*7-deazaguanine, 1H -purine, and hypoxanthine. 

70. The method in accordance with claims 58-62 or 63, 
further comprising selecting said at least one oligonucleotide 
analogue such that oligonucleotide analogue probes com- 
prises at least one peptide nucleic acid. 

71. The method in accordance with claims 58-62, or 63, 
further comprising selecting said at least one oligonucleotide 
analogue to increase image brightness when the oligonucle- 
otide target and the oligonucleotide analogue probe hybrid- 
ize in the presence of a fluorescent indicator, in comparison 
to a oligonucleotide probe without oligonucleotide analogs. 

72. The method in accordance with claims 58-62, or 63, 
further comprising providing said plurality of oligonucle- 
otide analogue probes in an array with at least 1000 other 
oligonucleotide analogue probes. 
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The present inventions relate to the synthesis and place- 
ment materials at known locations. In particular, one 
embodiment of the inventions provides a method and asso- 
ciated apparatus for preparing diverse chemical sequences at 
known locations on a single substrate surface. The inven- 
tions may be applied, for example, in the field of preparation 
of oligomer, peptide, nucleic acid, oligosaccharide, 
phospholipid, polymer, or drug congener preparation, espe- 
cially to create sources of chemical diversity for use in 
screening for biological activity. 

The relationship between structure and activity of mol- 
ecules is a fundamental issue in the study of biological 
systems. Structure-activity relationships are important in 
understanding, for example, the function of enzymes, the 
ways in which cells communicate with each other, as well as ^ 
cellular control and feedback systems. 

Certain macromolecules are known to interact and bind to 
other molecules having a very specific three-dimensional 
spatial and electronic distribution. Any large molecule hav- 
ing such specificity can be considered a receptor, whether it 45 
is an enzyme catalyzing hydrolysis of a metabolic 
intermediate, a cell-surface protein mediating membrane 
transport of ions, a glycoprotein serving to identify a par- 
ticular cell to its neighbors, an IgG-class antibody circulat- 
ing in the plasma, an oligonucleotide sequence of DNA in 50 
the nucleus, or the like. The various molecules which 
receptors selectively bind are known as ligands. 

Many assays are available for measuring the binding 
affinity of known receptors and ligands, but the information 
which can be gained from such experiments is often limited 55 
by the number and type of ligands which are available. 
Novel ligands are sometimes discovered by chance or by 
application of new techniques for the elucidation of molecu- 
lar structure, including x-ray crystallographic analysis and 
recombinant genetic techniques for proteins. 60 

Small peptides are an exemplary system for exploring the 
relationship between structure and function in biology. A 
peptide is a sequence of amino acids. When the twenty 
naturally occurring amino acids are condensed into poly- 
meric molecules they form a wide variety of three- 65 
dimensional configurations, each resulting from a particular 
amino acid sequence and solvent condition. The number of 



possible pentapeptides of the 20 naturally occurring amino 
acids, for example, is 20 3 or 3.2 million different peptides. 
The likelihood that molecules of this size might be useful in 
receptor-binding studies is supported by epitope analysis 
studies showing that some antibodies recognize sequences 
as short as a few amino acids with high specificity. 
Furthermore, the average molecular weight of amino acids 
puts small peptides in the size range of many currently 
useful pharmaceutical products. 

Pharmaceutical drug discovery is one type of research 
which relies on such a study of structure-activity relation- 
ships. In most cases, contemporary pharmaceutical research 
can be described as the process of discovering novel ligands 
with desirable patterns of specificity for biologically impor- 
tant receptors. Another example is research to discover new 
compounds for use in agriculture, such as pesticides and 
herbicides. 

Sometimes, the solution to a rational process of designing 
ligands is difficult or unyielding. Prior methods of preparing 
large numbers of different polymers have been painstakingly 
slow when used at a scale sufficient to permit effective 
rational or random screening. For example, the "Merrifield" 
method (/. Am. Chem. Soc. (1963) 85:2149-2154, which is 
incorporated herein by reference for all purposes) has been 
used to synthesize peptides on a solid support. In the 
Merrifield method, an amino acid is covalently bonded to a 
support made of an insoluble polymer. Another amino acid 
with an alpha protected group is reacted with the covalently 
bonded amino acid to form a dipeptide. After washing, the 
protective group is removed and a third amino acid with an 
alpha protective group is added to the dipeptide. This 
process is continued until a peptide of a desired length and 
sequence is obtained. Using the Merrifield method, it is not 
economically practical to synthesize more than a handful of 
peptide sequences in a day. 

To synthesize larger numbers of polymer sequences, it has 
also been proposed to use a series of reaction vessels for 
polymer synthesis. For example, a tubular reactor system 
may be used to synthesize a linear polymer on a solid phase 
support by automated sequential addition of reagents. This 
method still does not enable the synthesis of a sufficiently 
large number of polymer sequences for effective economical 
screening. 

Methods of preparing a plurality of polymer sequences 
are also known in which a foraminous container encloses a 
known quantity of reactive particles, the particles being 
larger in size than foramina of the container. The containers 
may be selectively reacted with desired materials to synthe- 
size desired sequences of product molecules. As with other 
methods known in the art, this method cannot practically be 
used to synthesize a sufficient variety of polypeptides for 
effective screening. 

Other techniques have also been described. These meth- 
ods include the synthesis of peptides on 96 plastic pins 
which fit the format of standard microliter plates. 
Unfortunately, while these techniques have been somewhat 
useful, substantial problems remain. For example, these 
methods continue to be limited in the diversity of sequences 
which -can be economically synthesized and screened. 

From the above, it is seen that an improved method and 
apparatus for synthesizing a variety of chemical sequences 
at known locations is desired. 

SUMMARY OF THE INVENTION 

An improved method and apparatus for the preparation of 
a variety of polymers is disclosed. 
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In one preferred embodiment, linker molecules are pro- 
vided on a substrate. A terminal end of the linker molecules 
is provided with a reactive functional group protected with 
a photoremovable protective group. Using lithographic 
methods, the photoremovable protective group is exposed to 5 
light and removed from the linker molecules in first selected 
regions. The substrate is then washed or otherwise contacted 
with a first monomer that reacts with exposed functional 
groups on the linker molecules. In a preferred embodiment, 
the monomer is an amino acid containing a photoremovable 10 
protective group at its amino or carboxy terminus and the 
linker molecule terminates in an amino or carboxy acid 
group bearing a photoremovable protective group. 

A second set of selected regions is, thereafter, exposed to 
light and the photoremovable protective group on the linker 15 
molecule/protected amino acid is removed at the second set 
of regions. The substrate is then contacted with a second 
monomer'containing a photoremovable protective group for 
reaction with exposed functional groups. This process is 
repeated to selectively apply monomers until polymers of a 20 
desired length and desired chemical sequence are obtained. 
Photolabile groups are then optionally removed and the 
sequence is, thereafter, optionally capped. Side chain pro- 
tective groups, if present, are also removed. 

By using the lithographic techniques disclosed herein, it 
is possible to direct light to relatively small and precisely 
known locations on the substrate. It is, therefore, possible to 
synthesize polymers of a known chemical sequence at 
known locations on the substrate. 

The resulting substrate will have a variety of uses 
including, for example, screening large numbers of poly- 
mers for biological activity. To screen for biological activity, 
the substrate is exposed to one or more receptors such as 
antibody whole cells, receptors on vesicles, lipids, or any 35 
one of a variety of other receptors. The receptors are 
preferably labeled with, for example, a fluorescent marker, 
radioactive marker, or a labeled antibody reactive with the 
receptor. The location of the marker on the substrate is 
detected with, for example, photon detection or autoradio- 4Q 
graphic techniques. Through knowledge of the sequence of 
the material at the location where binding is detected, it is 
possible to quickly determine which sequence binds with the 
receptor and, therefore, the technique can be used to screen 
large numbers of peptides. Other possible applications of the 45 
inventions herein include diagnostics in which various anti- 
bodies for particular receptors would be placed on a sub- 
strate and, for example, blood sera would be screened for 
immune deficiencies. Still further applications include, for 
example, selective "doping" of organic materials in semi- 5Q 
conductor devices, and the like. 

In connection with one aspect of the invention an 
improved reactor system for synthesizing polymers is also 
disclosed. The reactor system includes a substrate mount 
which engages a substrate around a periphery thereof. The 55 
substrate mount provides for a reactor space between the 
substrate and the mount through or into which reaction fluids 
are pumped or flowed. A mask is placed on or focused on the 
substrate and illuminated so as to deprotect selected regions 
of the substrate in the reactor space. A monomer is pumped 60 
through the reactor space or otherwise contacted with the 
substrate and reacts with the deprotected regions. By selec- 
tively deprotecting regions on the substrate and flowing 
predetermined monomers through the reactor space, desired 
polymers at known locations may be synthesized. $5 

Improved detection apparatus and methods are also dis- 
closed. The detection method and apparatus utilize a sub- 



strate having a large variety of polymer sequences at known 
locations on a surface thereof. The substrate is exposed to a 
fluorescently labeled receptor which binds to one or more of 
the polymer sequences. The substrate is placed in a micro- 
scope detection apparatus for identification of locations 
where binding takes place. The microscope detection appa- 
ratus includes a monochromatic or polychromatic light 
source for directing light at the substrate, means for detect- 
ing fluoresced light from the substrate, and means for 
determining a location of the fluoresced light. The means for 
detecting light fluoresced on the substrate may in some 
embodiments include a photon counter. The means for 
determining a location of the fluoresced light may include an 
x/y translation table for the substrate. Translation of the slide 
and data collection are recorded and managed by an appro- 
priately programmed digital computer. 

A further understanding of the nature and advantages of 
the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached 
drawings. 

BRIEF DESCRIPTION OF THE FIGURES 

FIG. 1 illustrates masking and irradiation of a substrate at 
a first location. The substrate is shown in cross-section; 

FIG. 2 illustrates the substrate after application of a 
monomer "A"; 

FIG. 3 illustrates irradiation of the substrate at a second 
location; 

FIG. 4 illustrates the substrate after application of mono- 
mer "B"; 

FIG. 5 illustrates irradiation of the "A" monomer; 

FIG. 6 illustrates the substrate after a second application 
of "B"; 

FIG. 7 illustrates a completed substrate; 

FIGS. 8A and 8B illustrate alternative embodiments of a 
reactor system for forming a plurality of polymers on a 
substrate; 

FIG. 9 illustrates a detection apparatus for locating fluo- 
rescent markers on the substrate; 

FIGS. 10A-10M illustrate the method as it is applied to 
the production of the trimers of monomers "A" and "B"; 

FIGS. 11A and 11B are fluorescence traces for standard 
fluorescent beads; 

FIGS. 12A and 12B are fluorescence curves for NVOC 
slides not exposed and exposed to light respectively; 

FIGS. 13 A to 13D are fluorescence plots of slides exposed 
through 100 //m, 50 fan, 20 fim y and 10 fim masks; 

FIG. 14A and 14B illustrates .fluorescence of a slide pith 
the peptide YGGFLon selected regions of-its surface which 
has been exposed to labeled Herz antibody specific for this 
sequence; 

FIGS. 15A and 15D illustrate formation of and a fluores- 
cence plot of a slide with a checkerboard pattern of YGGFL 
and GGFL exposed to labeled Herz antibody. FIG. 15A 
illustrates a 500x500 /on mask which has been focused on 
the substrate according to FIG. 8 A while FIG. 15B illustrates 
a 50x50 fan mask placed in direct contact with the substrate 
in accord with FIG. 8B; 

FIG. 16 is a fluorescence plot of YGGFL and PGGFL 
synthesized in a 50 pirn checkerboard pattern; 

FIG. 17 is a fluorescence plot of YPGGFL and is YGGFL 
synthesized in a 50 fan checkerboard pattern; 

FIGS. 18A and 18B illustrate the mapping of sixteen 
sequences synthesized on two different glass slides; 
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FIG. 19 is a fluorescence plot of the slide illustrated in 
FIG. ISA; and 

FIG. 20 is a fluorescence plot of the slide illustrated in 
FIG. 10B. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

CONTENTS 



I. Glossary 



10 



20 



I. Glossary 

II. General 

III. Polymer Synthesis 

IV. Details of One Embodiment of a Reactor System 

V. Details of One Embodiment of a Fluorescent Detection 15 
Device 

VI. Determination of Relative Binding Strength of Recep- 
tors 

VII. Examples 

A. Slide Preparation 

B. Synthesis of Eight Trimers of "A" and "B" 

C. Synthesis of a Dimer of an Aminopropyl Group and 
a Fluorescent Group 

D. Demonstration of Signal Capability ^ 

E. Determination of the Number of Molecules Per Unit 
Area 

F. Removal of NVOC and Attachment of a Fluorescent 
Marker 

G. Use of a Mask in Removal of NVOC 

H. Attachment of YGGFL and Subsequent Exposure to 
Herz Antibody and Goat Antimouse 

I. Monomer-by-Monomer Formation of YGGFL and 
Subsequent Exposure to Labeled Antibody 

J. Monomer-by-Monomer Synthesis of YGGFL and , s 
PGGFL 

K. Monomer-by Monomer Synthesis of YGGFL and 
YPGGFL 

L. Synthesis of an Array of Sixteen Different Amino 
Acid Sequences and Estimation of Relative Binding ^ 
Affinity to Herz Antibody 

VIII. Illustrative Alternative Embodiment 

IX. Conclusion 



30 



45 



The following terms are intended to have the following 
general meanings as they are used herein: 

1. Complementary: Refers to the topological compatibility 
or matching together of interacting surfaces of a ligand 
molecule and its receptor. Thus, the receptor and its ligand 50 
can be described as complementary, and furthermore, the 
contact surface characteristics are complementary to each . 
other. 

2. Epitope: The portion of an antigen molecule which is 
delineated by the area of interaction with the subclass of 55 
receptors known as antibodies. 

3. Ligand: A ligand is a molecule that is recognized by a 
particular receptor. Examples of ligands that can be inves- 
tigated by this invention include, but are not restricted to, 
agonists and antagonists for cell membrane receptors, 60 
toxins and venoms, viral epitopes, hormones (e.g., 
opiates, steroids, etc.), hormone receptors, peptides, 
enzymes, enzyme substrates, cofactors, drugs, lectins, 
sugars, oligonucleotides, nucleic acids, oligosaccharides, 
proteins, and monoclonal antibodies. 65 

4. Monomer: A member of the set of small molecules which 
can be joined together to form a polymer. The set of 



monomers includes but is not restricted to, for example, 
the set of common L-araino acids, the set of D- ami no 
acids, the set of synthetic amino acids, the set of nucle- 
otides and the set of pentoses and hexoses. As used herein, 
monomers refers to any member of a basis set for syn- 
thesis of a polymer. For example, dimers of L-amino acids 
- form a basis set of 400 monomers for synthesis of 
polypeptides. Different basis sets of monomers may be 
used at successive steps in the synthesis of a polymer. 

5. Peptide: A polymer in which the monomers are alpha 
amino acids and which are joined together through amide 
bonds and alternatively referred to as a polypeptide. In the 
context of this specification it should be appreciated that 
the amino acids may be the L-optical isomer or the 
D -optica! isomer. Peptides are more than two amino acid 
monomers long, and often more than 20 amino acid 
monomers long. Standard abbreviations for amino acids 
are used (e.g., P for proline). These abbreviations are 
included in Stryer, Biocfiemstry, Third Ed., 1988, which is 
incorporated herein by reference for all purposes. 

6. Radiation: Energy which may be selectively applied 
including energy having a wavelength of between 10" 14 
and 10 4 meters including, for example, electron beam 
radiation, gamma radiation, x-ray radiation, ultra-violet 
radiation, visible light, infrared radiation, microwave 
radiation, and radio waves. "Irradiation" refers to the 
application of radiation to a surface. 

7. Receptor: A molecule that has an affinity for a given 
ligand. Receptors may be naturally-occuring or manmade 
molecules. Also, they can be employed in their unaltered 
state or as aggregates with other species. Receptors may 
be attached, covalently or noncovalently, to a binding 

. member, either directly or via a specific binding sub- 
stance. Examples of receptors which can be employed by 
this invention include, but are not restricted to, antibodies, 
cell membrane receptors, monoclonal antibodies and anti- 
sera reactive with specific antigenic determinants (such as 
on viruses, cells or other materials), drugs, 
polynucleotides, nucleic acids, peptides, cofactors, 
lectins, sugars, polysaccharides, cells, cellular 
membranes, and organelles. Receptors are sometimes 
referred to in the art as anti-ligands. As the term receptors 
is used herein, no difference in meaning is intended. A 
"Ligand Receptor Pair" is formed when two mactomol- 
ecules have combined through molecular recognition to 
form a complex. 

Other examples of receptors which can be investigated by 
this invention include but are not restricted to: 

a) Microorganism receptors: Determination of ligands 
which bind to receptors, such as specific transport 
proteins or enzymes essential, to survival of 
microorganisms, is useful in a new class of antibiotics. 
Of particular value would be antibiotics against oppor- 
tunistic fungi, protozoa, and those bacteria resistant to 
the antibiotics in current use. 

b) Enzymes: For instance, the binding site of enzymes 
such as the enzymes responsible for cleaving neu- 
rotransmitters; determination of ligands which bind to 
certain receptors to modulate the action of the enzymes 
which cleave the different neurotransmitters is useful in 
the development of drugs which can be used in the 
treatment of disorders of neurotransmission. 

c) Antibodies: For instance, the invention may be useful 
in investigating the ligand-binding site on the antibody 
molecule which combines with the epitope of an anti- 
gen of interest; determining a sequence that mimics an 
antigenic epitope may lead to the development of 
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vaccines of which the immunogen is based on one or 
more of such sequences or lead to the development of 
related diagnostic agents or compounds useful in thera- 
peutic treatments such as for auto-immune diseases 
(e g-> by blocking the binding of the "self* antibodies). 5 

d) Nucleic Acids: Sequences of nucleic acids may be 
synthesized to establish DNA or RNA binding 
sequences. 

e) Catalytic Polypeptides: Polymers, preferably 
polypeptides, which are capable of promoting a chemi- 10 
cal reaction involving the conversion of one or more 
reactants to one or more products. Such polypeptides 
generally include a binding site specific for at least one 
reactant or reaction intermediate and an active func- 
tionality proximate to the binding site, which function- 35 
ality is capable of chemically modifying the bound 
reactant. Catalytic polypeptides are described in, for 
example, U.S. application Ser. No. 404,920, which is 
incorporated herein by reference for all purposes. 

f) Hormone receptors: For instance, the receptors for 20 
insulin and growth hormone. Determination of the 
ligands which bind with high affinity to a receptor is 
useful in the development of, for example, an oral 
replacement of the daily injections which diabetics 
must take to relieve the symptoms of diabetes, and in 25 
the other case, a replacement for the scarce human 
growth hormone which can only be obtained from 
cadavers or by recombinant DNA technology. Other 
examples are the vasoconstrictive hormone receptors; 
determination of those ligands which bind to a receptor 30 
may lead to the development of drugs to control blood 
pressure. 

g) Opiate receptors: Determination of ligands which bind 
to the opiate receptors in the brain is useful in the 
development of less-addictive replacements for mor- 35 
phine and related drugs. 

8. Substrate: A material having a rigid or semi-rigid surface. 
In many embodiments, at least one surface of the substrate 
will be substantially flat, although in some embodiments 

it may be desirable to physically separate synthesis 40 
regions for different polymers with, for example, wells, 
raised regions, etched trenches, or the like. According to 
other embodiments, small beads may be provided on the 
surface which may be released upon completion of the 
synthesis. 45 

9. Protective Group: A material which is bound to a mono- 
mer unit and which may be spatially removed upon 
selective exposure to an activator such as electromagnetic 
radiation. Examples of protective groups with utility 
herein include Nitroveratryloxy carbonyl, Nitrobenzyloxy 50 
carbonyl, Dimethyl dimethoxybenzyloxy carbonyl, 
5-Bromo-7-nitroindolinyl, o-Hydroxy-a-methyl 
cinnamoyl, and 2-oxymethylene anthraquinone. Other 
examples of activators include ion beams, electric fields, 
magnetic fields, electron beams, x-ray, and the like. 55 

10. Predefined Region: A predefined region is a localized 
area on a surface which is, was, or is intended to be 
activated for formation of a polymer. The predefined 
region may have any convenient shape, e.g., circular, 
rectangular, elliptical, wedge-shaped, etc. For the sake of 60 
brevity herein, "predefined regions" are sometimes 
referred to simply as "regions." 

11. Substantially Pure: A polymer is considered to be "sub- 
stantially pure" within a predefined region of a substrate 
when it exhibits characteristics that distinguish it from 65 
other predefined regions. Typically, purity will be mea- 
sured in terms of biological activity or function as a result 



of uniform sequence. Such characteristics will typically 
be measured by way of binding with a selected ligand or 
receptor. 

II. General 

The present invention provides methods and apparatus for 
the preparation and use of a substrate having a plurality of 
polymer sequences in predefined regions. The invention is 
described herein primarily with regard to the preparation of 
molecules containing sequences of amino acids, but could 
readily be applied in the preparation of other polymers. Such 
polymers include, for example, both linear and cyclic poly- 
mers of nucleic acids, polysaccharides, phospholipids, and 
peptides having either a-, p-, or co-amino acids, hetero- 
polymers in which a known drug is covalently bound to any 
of the above, polyurethanes, polyesters, polycarbonates, 
polyureas, polyamides, polyethyleneimines, polyarylene 
sulfides, polysiloxanes, polyimides, polyacetates, or other 
polymers which will be apparent upon review of this dis- 
closure. In a preferred embodiment, the invention herein is 
used in the synthesis of peptides. 

The prepared substrate may, for example, be used in 
screening a variety of polymers as ligands for binding with 
a receptor, although it will be apparent that the invention 
could be used for the synthesis of a receptor for binding with 
a ligand. The substrate disclosed herein will have a wide 
variety of other uses. Merely by way of example, the 
invention herein can be used in determining peptide and 
nucleic acid sequences which bind to proteins, finding 
sequence-specific binding drugs, identifying epitopes rec- 
ognized by antibodies, and evaluation of a variety of drugs 
for clinical and diagnostic applications, as well as combi- 
nations of the above. 

The invention preferably provides for the use of a sub- 
strate "S" with a surface. Linker molecules "L" are option- 
ally provided on a surface of the substrate. The purpose of 
the linker molecules, in some embodiments, is to facilitate 
receptor recognition of the synthesized polymers. 

Optionally, the linker molecules may be chemically pro- 
tected for storage purposes. A chemical storage protective 
group such as t-BOC (t-butoxycarbonyl) may be used in 
some embodiments. Such chemical protective groups would 
be chemically removed upon exposure to, for example, 
acidic solution and would serve to protect the surface during 
storage and be removed prior to polymer preparation. 

On the substrate or a distal end of the linker molecules, a 
functional group with a protective group P 0 is provided. The 
protective group P 0 may be removed upon exposure to 
radiation, electric fields, electric currents, or other activators 
to expose the functional group. 

In a preferred embodiment, the radiation is ultraviolet 
(UV), infrared (IR), or visible light. As more fully described 
below, the protective group may alternatively be an 
electrochemically-sensitive group which may be removed in 
the presence of an electric field. In still further alternative 
embodiments, ion beams, electron beams, or the like may be 
used for deprotection. 

In some embodiments, the exposed regions and, therefore, 
the area upon which each distinct polymer sequence is 
synthesized are smaller than about 1 cm 2 or less than 1 mm 2 . 
In preferred embodiments the exposed area is less than about 
10,000 ^m 2 or, more preferably, less than 100 ^m 2 and may, 
in some embodiments, encompass the binding site for as few 
as a single molecule. Within these regions, each polymer is 
preferably synthesized in a substantially pure form. 

Concurrently or after exposure of a known region of the 
substrate to light, the surface is contacted with a first 
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monomer unit M a which reacts with the functional group 
which has been exposed by the deprotection step. The first 
monomer includes a protective group P J . P 1 may or may not 
be the same as P 0 . 

Accordingly, after a first cycle, known first regions of the 
surface may comprise the sequence: 

while remaining regions of the surface comprise the 
sequence: 

S-L-Po. 

Thereafter, second regions of the surface (which may 
include the first region) are exposed to light and contacted 
with a second monomer M 2 (which may or may not be the 
same as MJ having a protective group P 2 . P 2 may or may 
not be the same as P 0 and P 1 . After this second cycle, 
different regions of the substrate may comprise one or more 
of the following sequences: 

S-L-M^Mj-Pj 
S-L-M 2 -P 2 
S-L-M^Pj aod/or 
S-L-Pq. 

The above process is repeated until the substrate includes 
desired polymers of desired lengths. By controlling the 
locations of the substrate exposed to light and the reagents 
exposed to the substrate following exposure, trie location of 
each sequence will be known. 

Thereafter, the protective groups are removed from some 
or all of the substrate and the sequences are, optionally, 
capped with a capping unit C. The process results in a 
substrate having a surface with a plurality of polymers of the 
following general formula: 

s-WCMX^HMJ . . . (M>[C] 

where square brackets indicate optional groups, and M f - . . . 
M x indicates any sequence of monomers. The number of 
monomers could cover a wide variety of values, but in a 
preferred embodiment they will range from 2 to 100. 

In some embodiments a plurality of locations on the 
substrate polymers are to contain a common monomer 
subsequence. For example, it may be desired to synthesize 
a sequence S-M l -M2"M 3 at first locations and a sequence 
S-M 4 -M 2 -M 3 at second locations. The process would com- 
mence with irradiation of the first locations followed by 
contacting with M 2 -P, resulting in thd sequence S-M a -P at 
the first location. The second locations would then be 
irradiated and contacted with M 4 -P, resulting in the sequence 
S-M 4 -P at the second locations. Thereafter both the first and 
second locations would be irradiated and contacted with the 
dimer M 2 -M 3 , resulting in the sequence S-M 1 -M 2 -M 3 at the 
first locations and S-M 4 -M 2 -M 3 at the second locations. Of 
course, common subsequences of any length could be uti- 
lized including those in a range of 2 or more monomers, 2 
to 100 monomers, 2 to 20 monomers, and a most preferred 
range of 2 to 3 monomers. 

According to other embodiments, a set of masks is used 
for the first monomer layer and, thereafter, varied light 
wavelengths are used for selective deprotection. For 
example, in the process discussed above, first regions are 
first exposed through a mask and reacted with a first mono- 
mer having a first protective group P lf which is removable 
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upon exposure to a first wavelength of light (e.g., IR). 
Second regions are masked and reacted with a second 
monomer having a second protecive group P 2 , which is 
removable upon exposure to a second wavelength of light 
5 (e.g., UV). Thereafter, masks become unnecessary in the 
synthesis because . the entire substrate may be exposed 
alternatively to the first and second wavelengths of light in 
the deprotection cycle. 
The polymers prepared on a substrate according to the 
10 above methods will have a variety of uses including, for 
example, screening for biological activity. In such screening 
activities, the substrate containing the sequences is exposed 
to an unlabeled or labeled receptor such as an antibody, 
receptor on a cell, phospholipid vesicle, or any one of a 
15 variety of other receptors. In one preferred embodiment the 
polymers are exposed to a first, unlabeled receptor of interest 
and, thereafter, exposed to a labeled receptor-specific rec- 
ognition element, which is, for example, an antibody. This 
process will provide signal amplification in the detection 
20 stage. 

The receptor molecules may bind with one or more 
polymers on the substrate. The presence of the labeled 
receptor and, therefore, the presence of a sequence which 
binds with the receptor is detected in a preferred embodi- 

25 ment through the use of autoradiography, detection of fluo- 
rescence with* a charge-coupled device, fluorescence 
microscopy, or the like. The sequence of the polymer at the 
locations where the receptor binding is detected may be used 
to determine all or part of a sequence which is complemen- 

30 tary to the receptor. 

Use of the invention herein is illustrated primarily with 
reference to screening for biological activity. The invention 
will, however, find many other uses. For example, the 
invention may be used in information storage (e.g., on 

35 optical disks), production of molecular electronic devices, 
production of stationary phases in separation sciences, pro- 
duction of dyes and brightening agents, photography, and in 
immobilization of cells, proteins, lectins, nucleic acids, 
polysaccharides and the like in patterns on a surface via 

40 molecular recognition of specific polymer sequences. By 
synthesizing the same compound in adjacent, progressively 
differing concentrations, a gradient will be established to 
control che mo taxis or to develop diagnostic dipsticks which, 
for example, titrate an antibody against an increasing 

45 amount of antigen. By synthesizing several catalyst mol- 
ecules in close proximity, more efficient multistep conver- 
sions may be achieved by "coordinate immobilization." 
Coordinate immobilization also may be used for electron 
transfer systems, as well as to provide both structural 

50 integrity and other desirable properties to materials such as 
lubrication, wetting, etc. 

According to alternative embodiments, molecular biodis- 
tribution or pharmacokinetic properties may be examined. 
For example, to assess resistance to intestinal or serum 

55 proteases, polymers may be capped with a fluorescent tag 
and exposed to biological fluids of interest. 

III. Polymer Synthesis 

FIG. 1 illustrates one embodiment of the invention dis- 
60 closed herein in which a substrate 2 is shown in cross- 
section. Essentially, any conceivable substrate may be 
employed in the invention. The substrate may be biological, 
nonbio logical, organic, inorganic, or a combination of any of 
these, existing as particles, strands, precipitates, gels, sheets, 
65 tubing, spheres, containers, capillaries, pads, slices, films, 
plates, slides, etc. The substrate may have any convenient 
shape, such as a disc, square, sphere, circle, etc. The 
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substrate is preferably flat but may take on a variety of 
alternative surface configurations. For example, the sub- 
strate may contain raised or depressed regions on which the 
synthesis takes place. The substrate and its surface prefer- 
ably form a rigid support on which to carry out the reactions 5 
described herein. The substrate and its surface is also chosen 
to provide appropriate light-absorbing characteristics. For 
instance, the substrate may be a polymerized Langmuir 
Blodgett film, functionalized glass, Si, Ge, GaAs, GaP, Si0 2 , 
SiN 4 , modified silicon, or any one of a wide variety of gels 10 
or polymers such as (poly)tetrafluoroethylene, (poly) 
vinylidenedifluoride, polystyrene, polycarbonate, or combi- 
nations thereof. Other substrate materials will be readily 
apparent to those of skill in is the art upon review of this 
disclosure. In a preferred embodiment the substrate is flat 15 
glass or single-crystal silicon with surface relief features of 
less than 10 A. 

According to some embodiments, the surface of the 
substrate is etched using well known techniques to provide 
for desired surface features. For example, by way of the 20 
formation of trenches, v-grooves, mesa structures, or the 
like, the synthesis regions may be more closely placed 
within the focus point of impinging light, be provided with 
reflective "mirror" structures for maximization of light col- 
lection from fluorescent sources, or the like. 25 

Surfaces on the solid substrate will usually, though not 
always, be composed of the same material as the substrate. 
Thus, the surface may be composed of any of a wide variety 
of materials, for example, polymers, plastics, resins, 
polysaccharides, silica or silica-based materials, carbon, 
metals, inorganic glasses, membranes, or any of the above- 
listed substrate materials. In some embodiments the surface 
may provide for the use of caged binding members which 
are attached firmly to the surface of the substrate in accord 
with the teaching of copending application Ser. No. 404,920, 
previously incorporated herein by reference. Preferably, the 35 
surface will contain reactive groups, which could be 
carboxyl, amino, hydroxyl, or the like. Most preferably, the 
surface will be optically transparent and will have surface 
Si — OH functionalities, such as are found on silica surfaces. 

The surface 4 of the substrate is preferably provided with 40 
a layer of linker molecules 6, although it will be understood 
that the linker molecules are not required elements of the 
invention. The linker molecules are preferably of sufla- 
cientlength to permit polymers in a completed substrate to 
interact freely with molecules exposed to the substrate. The 45 
linker molecules should be 6-50 atoms long to provide 
sufficient exposure. The linker molecules may be, for 
example, aryl acetylene, ethylene glycol oligomers contain- 
ing 2-10 monomer units, diamines, diacids, amino acids, or 
combinations thereof. Other linker molecules may be used 5Q 
in light of this disclsoure. 

According to alternative embodiments, the linker mol- 
ecules are selected based upon their hydrophilic/ 
hydrophobic properties to improve presentation of synthe- 
sized polymers to certain receptors. For example, in the case 
of a hydrophilic receptor, hydrophilic linker molecules will 55 
be preferred so as to permit the receptor to more closely 
approach the synthesized polymer. 

According to another alternative embodiment, linker mol- 
ecules are also provided with a photocleavable group at an 
intermediate position. The photocleavable group is prefer- 60 
ably cleavable at a wavelength different from the protective 
group. This enables removal of the various polymers fol- 
lowing completion of the synthesis by way of exposure to 
the different wavelengths of light. 

The linker molecules can be attached to the substrate via 65 
carbon-carbon bonds using, for example, (poly) 
trifluorochloroethylene surfaces, or preferably, by siloxane 
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bonds (using, for example, glass or silicon oxide surfaces). 
Siloxane bonds with the surface of the substrate may be 
formed in one embodiment via reactions of linker molecules 
bearing trichlorosilyl groups. The linker molecules may 
optionally be attached in an ordered array, i.e., as parts of the 
head groups in a polymerized Langmuir Blodgett film. In 
alternative embodiments, the linker molecules are adsorbed 
to the surface of the substrate. 

The linker molecules and monomers used herein are 
provided with a functionial group to which is bound a 
protective group. Preferably, the protective group is on the 
distal or terminal end of the linker molecule opposite the 
substrate. The protective group may be either a negative 
protective group (i.e., the protective group renders the linker 
molecules less reactive with a monomer upon exposure) or 
a positive protective group (i.e., the protective group renders 
the linker molecules more reactive with a monomer upon 
exposure). In the case of negative protective groups an 
additional step of reactivation will be required. In some 
embodiments, this will be done by heating. 

The protective group on the linker molecules may be 
selected from a wide variety of positive light-reactive groups 
preferably including nitro aromatic compounds such as 
o-nitrobenzyl derivatives or benzylsulfonyl. In a preferred 
embodiment, 6-nitroveratryloxy-carbonyl (NVOC), 
2-nitrobenzyloxycarbonyl (NBOC) or a,ct-dimethyl- 
dimethoxybenzyloxycarbonyl (DDZ) is used. In one 
embodiment, a nitro aromatic compound containing a ben- 
zylic hydrogen ortho to the nitro group is used, i.e., a 
chemical of the form: 




where Rj is alkoxy, alkyl, halo, aryl, alkenyl, or hydrogen; 
R 2 is alkoxy, alkyl, halo, aryl, nitro, or hydrogen; R 3 is 
alkoxy, alkyl, halo, nitro, aryl, or hydrogen; R 4 is alkoxy, 
alkyl, hydrogen, aryl, halo, or nitro; and R 5 is alkyl, alkynyl, 
cyano, alkoxy, hydrogen, halo, aryl, or alkenyl. Other mate- 
rials which may be used include o-hydroxy-a-methyl cin- 
namoyl derivatives. Photoremovable protective groups are 
described in, for example, Patchornik, J. Am. Chem. Soc. 
(1970) 92:6333 and Amit et al., J. Org. Chem. (1974) 
39:192, both of which are incorporated herein by reference. 

In an alternative embodiment the positive reactive group 
is activated for reaction with reagents in solution. For 
example, a 5-bromo-7-nitro indoline group, when bound to 
a carbonyl, undergoes reaction upon exposure to light at 420 
nm, 

In a second alternative embodiment, the reactive group on 
the linker molecule is selected from a wide variety of 
negative light-reactive groups including a cinammate group. 

Alternatively, the reactive group is activated or deacti- 
vated by electron beam lithography, x-ray lithography, or 
any other radiation. Suitable reactive groups for electron 
beam lithography include sulfonyl. Other methods may be 
used including, for example, exposure to a current source. 
Other reactive groups and methods of activation may be 
used in light of this disclosure. 

As shown in FIG. 1, the linking molecules are preferably 
exposed to, for example, light through a suitable mask 8 
using photolithographic techniques of the type known in the 
semiconductor industry and described in, for example, Sze, 
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VLSI Technology, McGraw-Hill (1983), and Mead et al., 
Introduction to VLSI Systems, Addison-Wesley (1980), 
which are incorporated herein by reference for all purposes. 
The light may be directed at either the surface containing the 
protective groups or at the back of the substrate, so long as 
the substrate is transparent to the wavelength of light needed 
for removal of the protective groups. In the embodiment 
shown in FIG. 1, light is directed at the surface of the 
substrate containing the protective groups. FIG. 1 illustrates 
the use of such masking techniques as they are applied to a 
positive reactive group so as to activate linking molecules 
and expose functional groups in areas 10a and 106. 

The mask 8 is in one embodiment a transparent support 
material selectively coated with a layer of opaque material. 
Portions of the opaque material are removed, leaving opaque 
material in the precise pattern desired on the substrate 
surface. The mask is brought into close proximity with, 
imaged on, or brought directly into contact with the substrate 
surface as shown in FIG. 1. "Openings" in the mask corre- 
spond to locations on the substrate where it is desired to 
remove photoremovable protective groups from the sub- 
strate. Alignment may be performed using conventional 
alignment techniques in which alignment marks (not shown) 
are used to accurately overlay successive masks with pre- 
vious patterning steps, or more sophisticated techniques may 
be used. For example, interferometric techniques such as the 
one described in Flanders et al., "A New Interferometric 
Alignment Technique," App. Phys. Lett. (1977) 31:426-428, 
which is incorporated herein by reference, may be used. 

To enhance contrast of light applied to the substrate, it is 
desirable to provide contrast enhancement materials 
between the mask and the substrate according to some 
embodiments. This contrast enhancement layer may com- 
prise a molecule which is decomposed by light such as 
quinone diazid or a material which is transiently bleached at 
the wavelength of interest. Transient bleaching of materials 
will allow greater penetration where light is applied, thereby 
enhancing contrast. Alternatively, contrast enhancement 
may be provided by way of a cladded fiber optic bundle. 

The light may be from a conventional incandescent 
source, a laser, a laser diode, or the like. If non-collimated 
sources of light are used it may be desirable to provide a 
thick- or multi-layered mask to prevent spreading of the 
light onto the substrate. It may, further, be desirable in some 
embodiments to utilize groups which are sensitive to differ- 
ent wavelengths to control synthesis. For example, by using 
groups which are sensitive to different wavelengths, it is 
possible to select branch positions in the synthesis of a 
polymer or eliminate certain masking steps. Several reactive 
groups along with their corresponding wavelengths for 
deprotection are provided in Table 1. 

TABLE 1 



Group 



Approximate 
Deprotection Wavelength 



Nitroveratryloxy carbonyl (NVOQ 
Nitrobenzyloxy carbonyl (NBOC) 
Dimethyl dirnethoxybenzyloxy carbonyl 
5 - Bromo-7-n Uroindolinyl 
o-Hydroxy-a- methyl cinnamoyl 
2-Qxymethylene anthraquinoae 



UV (300-400 nm) 
UV (300-350 nm) 
UV (280-300 nm) 
UV (420 nm) 
UV (300-350 nm) 
UV (350 nm) 



While the invention is illustrated primarily herein by way 
of the use of a mask to illuminate selected regions the 
substrate, other techniques may also be used. For example, 
the substrate may be translated under a modulated laser or 
diode light source. Such techniques are discussed in, for 
example, U.S. Pat. No. 4,719,615 (Feyrer et al.), which is 
incorporated herein by reference. In alternative embodi- 
ments a laser galvanometric scanner is utilized. In other 
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embodiments, the synthesis may take place on or in contact 
with a conventional liquid crystal (referred to herein as a 
"light valve") or fiber optic light sources. By appropriately 
modulating liquid crystals, light may be selectively con- 
5 trolled so as to permit light to contact selected regions of the 
substrate. Alternatively, synthesis may take place on the end 
of a series of optical fibers to which light is selectively 
applied. Other means of controlling the location of light 
exposure will be apparent to those of skill in the art. 
The substrate may be irradiated either in contact or not in 
to contact with a solution (not shown) and is, preferably, 
irradiated in contact with a solution. The solution contains 
reagents to prevent the by-products formed by irradiation 
from interfering with synthesis of the polymer according to 
some embodiments. Such by-products might include, for 
15 example, carbon dioxide, nitrosocarbonyl compounds, sty- 
rene derivatives, indole derivatives, and products of their 
photochemical reactions. Alternatively, the solution may 
contain reagents used to match the index of refraction of the 
substrate. Reagents added to the solution may further 
include, for example, acidic or basic buffers, thiols, substi- 
tuted hydrazines and hydroxylamines, reducing agents (e.g., 
NADH) or reagents known to react with a given functional 
group (e.g., aryl nitroso+glyoxylic acid-*aryl 
formhydroxamate+COa). 
Either concurrently with or after the irradiation step, the 
25 linker molecules are washed or otherwise contacted with a 
first monomer, illustrated by "A" in regions 12a and 12b in 
FIG. 2. The first monomer reacts with the activated func- 
tional groups of the linkage molecules which have been 
exposed to light. The first monomer, which is preferably an 
30 amino acid, is also provided with a photoprotective group. 
The photoprotective group on the monomer may be the same 
as or different than the protective group used in the linkage 
molecules, and may be selected from any of the above- 
described protective groups. In one embodiment, the pro- 
35 tective groups for the A monomer is selected from the group 
NBOC and NVOC 

As shown in FIG. 3, the process of irradiating is thereafter 
repeated, with a mask repositioned so as to remove linkage 
protective groups and expose functional groups in regions 
14a and 146 which are illustrated as being regions which 
were protected in the previous masking step. As an alterna- 
tive to repositioning of the first mask, in many embodiments 
a second mask will be utilized. In other alternative 
embodiments, some steps may provide for illuminating a 
common region in successive steps. As shown in FIG. 3, it 
may be desirable to provide separation between irradiated . 
regions. For example, separation of about 1-5 t*m may be 
appropriate to account for alignment tolerances. 

As shown in FIG. 4, the substrate is then exposed to a 
second protected monomer "B," producing B regions 16a 
50 and 16b. Thereafter, the substrate is again masked so as to 
remove the protective groups and expose reactive groups on 
A region 12a and B region 16b. The substrate is again 
exposed to monomer B, resulting in the formation of the 
structure shown in FIG. 6. The dimers B-A and B-B have 
been produced on the substrate. 

A subsequent series of masking and contacting steps 
similar to those described above with A (not shown) pro- 
vides the structure shown in FIG. 7. The process provides all 
possible dimers of B and A, i.e., B-A, A-B, A-A, and B-B. 

The substrate, the area of synthesis, and the area for 
synthesis of each individual polymer could be of any size or 
shape. For example, squares, ellipsoids, rectangles, 
triangles, circles, or portions thereof, along with irregular 
geometric shapes, may be utilized. Duplicate synthesis areas 
may also be applied to a single substrate for purposes of 
65 redundancy. 

In one embodiment the regions 12 and 16 on the substrate 
will have a surface area of between about 1 cm 2 and 10" 10 
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cm 2 . In some embodiments the regions 12 and 16 have areas 
of less than about 10" 1 cm 2 , 10' 2 cm 2 , 1CT 3 cm 2 , 10 -4 cm 2 , 
1CT 5 cm 2 , 10" 6 cm 2 , 10" 7 cm 2 , 10 -8 cm 2 , or 10" 10 cm 2 . In 
a preferred embodiment, the regions 12 and 16 are between 
about 10x10 /an and 500x500 /an. 5 

In some embodiments a single substrate supports more 
than about 10 different monomer sequences and perferably 
more than about 100 different monomer sequences, although 
in some embodiments more than about 10 3 , 10 4 , 10 5 , 10 6 , 
10 7 , or 10 8 different sequences are provided on a substrate. 3Q 
Of course, within a region of the substrate in which a 
monomer sequence is synthesized, it is preferred that the 
monomer sequence be substantially pure. In some 
embodiments, regions of the substrate contain polymer 
sequences which are at least about 1%, 5%, 10%, 15%, 20%, 
25%, 30%, 35%, 40%, 45%, 50%, 60%, 70%, 80%, 90%, 15 
95%, 96%, 97%, 98%, or 99% pure. 

According to some embodiments, several sequences are 
intentionally provided within a single region so as to provide 
an initial screening for biological activity, after which mate- 
rials within regions exhibiting significant binding are further 20 
evaluated. 

IV. Details of One Embodiment of a Reactor 

System 

FIG. 8A schematically illustrates a preferred embodiment 25 
of a reactor system 100 for synthesizing polymers on the 
prepared substrate in accordance with one aspect of the 
invention. The reactor system includes a body 102 with a 
cavity 104 on a surface thereof. In preferred embodiments 
the cavity 104 is between about 50 and 1000 /mi deep with 30 
a depth of about 500 /an preferred. 

The bottom of the cavity is preferably provided with an 
array of ridges 106 which extend both into the plane of the 
Figure and parallel to the plane of the Figure. The ridges are 
preferably about 50 to 200 /an deep and spaced at about 2 35 
to 3mm. The purpose of the ridges is to generate turbulent 
flow for better mixing. The bottom surface of the cavity is 
preferably light absorbing so as to prevent reflection of 
impinging light. 

A substrate 112 is mounted above the cavity 104. The 40 
substrate is provided along its bottom surface 114 with a 
photoremovable protective group such as NVOC with or 
without an intervening linker molecule. The substrate is 
preferably transparent to a wide spectrum of light, but in 
some embodiments is transparent only at a wavelength at 45 
which the protective group may be removed (such as UV in 
the case of NVOC). The substrate in some embodiments is 
a conventional microscope glass slide or cover slip. The 
substrate is preferably as thin as possible, while still pro- 
viding adequate physical support. Preferably, the substrate is $Q 
less than about 1 mm thick, more preferably less than 0.5 
mm thick, more preferably less than 0.1 mm thick, and most 
preferably less than 0.05 mm thick. In alternative preferred 
embodiments, the substrate is quartz or silicon. 

The substrate and the body serve to seal the cavity except 
for an inlet port 108 and an outlet port 110. The body and the 
substrate may be mated for sealing in some embodiments 
with one or more gaskets. According to a preferred 
embodiment, the body is provided with two concentric 
gaskets and the intervening space is held at vacuum to 
ensure mating of the substrate to the gaskets. 60 

Fluid is pumped through the inlet port into the cavity by 
way of a pump 116 which may be, for example, a model no. 
B-120-S made by Eldex Laboratories. Selected fluids are 
circulated into the cavity by the pump, through the cavity, 
and out the outlet for recirculation or disposal. The reactor 65 
may be subjected to ultrasonic radiation and/or heated to aid 
in agitation in some embodiments. 
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Above the substrate 112, a lens 120 is provided which 
may be, for example, a 2" 100 mm focal length fused silica 
lens. For the sake of a compact system, a reflective mirror 
122 may be provided for directing light from a light source 
124 onto the substrate. Light source 124 may be, for 
example, a Xe(Hg) light source manufactured by Oriel and 
having model no. 66024; Asecond lens 126 may be provided 
for the purpose of projecting a mask image onto the substrate 
in combination with lens 112. This form of lithography is 
referred to herein as projection printing. As will be apparent 
from this disclosure, proximity printing and the like may 
also be used according to some embodiments. 

Light from the light source is permitted to reach only 
selected locations on the substrate as a result of mask 128. 
Mask 128 may be, for example, a glass slide having etched 
chrome thereon. The mask 128 in one embodiment is 
provided with a grid of transparent locations and opaque 
locations. Such masks may be manufactured by, for 
example, Photo Sciences, Inc. Light passes freely through 
the transparent regions of the mask, but is reflected from or 
absorbed by other regions. Therefore, only selected regions 
of the substrate are exposed to light. 

As discussed above, light valves (LCD's) may be used as 
an alternative to conventional masks to selectively expose 
regions of the substrate. Alternatively, fiber optic faceplates 
such as those available from Schott Glass, Inc, may be used 
for the purpose of contrast enhancement of the mask or as 
the sole means of restricting the region to which light is 
applied. Such faceplates would be placed directly above or 
on the substrate in the reactor shown in FIG. 8A. In still 
further embodiments, flys-eye lenses, tapered fiber optic 
faceplates, or the like, may be used for contrast enhance- 
ment. 

In order to provide for illumination of regions smaller 
than a wavelength of light, more elaborate techniques may 
be utilized. For example, according to one preferred 
embodiment, light is directed at the substrate by way of 
molecular microcrystals on the tip of, for example, micropi- 
pettes. Such devices are disclosed in Lieberman et al., "A 
Light Source Smaller Than the Optical Wavelength," Sci- 
ence (1990) 247:59-61, which is incorporated herein by 
reference for all purposes. 

In operation, the substrate is placed on the cavity and 
sealed thereto. All operations in the process of preparing the 
substrate are carried out in a room lit primarily or entirely by 
light of a wavelength outside of the light range at which the 
protective group is removed. For example, in the case of 
NVOC, the room should be lit with a conventional dark 
room light which provides little or no UV light. All opera- 
tions are preferably conducted at about room temperature. 

A first, deprotection fluid (without a monomer) is circu- 
lated through the cavity. The solution preferably is of 5 mM 
sulfuric acid in dioxane solution which serves to keep 
exposed amino groups protonated and decreases their reac- 
tivity with photolysis by-products. Absorptive materials 
such as N,N-diethylamino 2,4-dinitrobenzene, for example, 
may be included in the deprotection fluid which serves to 
absorb light and prevent reflection and unwanted photolysis. 

The slide is, thereafter, positioned in a light raypath from 
the mask such that first locations on the substrate aire 
illuminated and, therefore, deprotected. In preferred 
embodiments the substrate is illuminated for between about 
1 and 15 minutes with a preferred illumination time of about 
10 minutes at 10-20 mW/cm 2 with 365 nm light. The slides 
are neutralized (i.e., brought to a pH of about 7) after 
photolysis with, for example, a solution of 
di-isopropylethylamine (DIEA) in methylene chloride for 
about 5 minutes. 

The first monomer is then placed at the first locations on 
the substrate. After irradiation, the slide is removed, treated 
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in bulk, and then reinstalled in the flow cell. Alternatively, 
a fluid containing the first monomer, preferably also pro- 
tected by a protective group, is circulated through the cavity 
by way of pump 116. If, for example, it is desired to attach 
the amino acid Y to the substrate at the first locations, the 
amino acid Y (bearing a protective group on its a-nitrogen), 
along with reagents used to render the monomer reactive, 
and/or a carrier, is circulated from a storage container 118, 
through the pump, through the cavity, and back to the inlet 
of the pump. 

The monomer carrier solution is, in a preferred 
embodiment, formed by mixing of a first solution (referred 
to herein as solution "A") and a second solution (referred to 
herein as solution "B"). Table 2 provides an illustration of a 
mixture which may be used for solution A. 

TABLE 2 

Representative Monomer Carrier Solution "A" 

100 mg NVOC amino protected amino acid 
37 mg HOBT (1 -Hydroxy benzotriazole) 
250 /d DMF (Dimcthylformamide) 
86 fd DIEA (Diisopropylethyl amine) 



The composition of solution B is illustrated in Table 3. 
Solutions A and B are mixed and allowed to react at room 
temperature for about 8 minutes, then diluted with 2 ml of 
DMF, and 500 t u\ are applied to the surface of the slide or the 
solution is circulated through the reactor system and allowed 
to react for about 2 hours at room temperature. The slide is 
then washed with DMF, methylene chloride and ethanol. 

TABLE 3 

Representative Monomer Carrier Solution "B" 
250 /d DXF 

111 mg BOP (Benzotxiazolyl-n-oxy-ths(dimethylamino) 
phosphoniumhexafluorophosphate) 



As the solution containing the monomer to be attached is 
circulated through the cavity, the amino acid or other mono- 
mer will react at its carboxy terminus with amino groups on 
the regions of the substrate which have been deprotected. Of 
course, while the invention is illustrated by way of circula- 
tion of the monomer through the cavity, the invention could 
be practiced by way of removing the slide from the reactor 
and submersing it in an appropriate monomer solution. 

After addition of the first monomer, the solution contain- 
ing the first amino acid is then purged from the system. After 
circulation of a sufficient amount of the DMF/methylene 
chloride such that removal of the amino acid can be assured 
(e.g., about 50x times the volume of the cavity and carrier 
lines), the mask or substrate is repositioned, or a new mask 
is utilized such that second regions on the substrate will be 
exposed to light and the light 124 is engaged for a second 
exposure. This will deprotect second regions on the substrate 
and the process is repeated until the desired polymer 
sequences have been synthesized. 

The entire derivatized substrate is then exposed to a 
receptor of interest, preferably labeled with, for example, a 
fluorescent marker, by circulation of a solution or suspen- 
sion of the receptor through the cavity or by contacting the 
surface of the slide in bulk. The receptor will preferentially 
bind to certain regions of the substrate which contain 
complementary sequences. 

Antibodies are typically suspended in what is commonly 
referred to as "supercocktail," which may be, for example, 
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a solution of about 1% BSA (bovine serum albumin), 0 5% 
Tween in PBS (phosphate buffered saline) buffer. The anti- 
bodies are diluted into the supercocktail buffer to a final 
concentration of, for example, about 0.1 to 4 ^g/ml. 

5 FIG. 8B illustrates an alternative preferred embodiment of 
the. reactor shown in FIG. 8A. According to . this 
embodiment, the mask 128 is placed directly in contact with 
the substrate. Preferably, the etched portion of the mask is 
placed face down so as to reduce the effects of light 

10 dispersion. According to this embodiment, the imaging 
lenses 120 and 126 are not necessary because the mask is 
brought into close proximity with the substrate. 

For purposes of increasing the signal-to-noise ratio of the 
technique, some embodiments of the invention provide for 
exposure of the substrate to a first labeled or unlabeled 
receptor followed by exposure of a labeled, second receptor 
(e.g., an antibody) which binds at multiple sites on the first 
receptor. If, for example, the first receptor is an antibody 
derived from a first species of an animal, the second receptor 
is an antibody derived from a second species directed to 
epitopes associated with the first species. In the case of a 
mouse antibody, for example, fluorescently labeled goat 
antibody or antiserum which is antimouse may be used to 
bind at multiple sites on the mouse antibody, providing 
several times the fluorescence compared to the attachment of 
a single mouse antibody at each binding site. This process 
may be repeated again with additional antibodies (e.g., 
goat-mouse-goat, etc.) for further signal amplification. 

In preferred embodiments an ordered sequence of masks 
is utilized. In some embodiments it is possible to use as few 
as a single mask to synthesize all of the possible polymers 
of a given monomer set. 

If, for example, it is desired to synthesize all 16 dinucle- 
otides from four bases, a 1 cm square synthesis region is 
divided conceptually into 16 boxes, each 0.25 cm wide. 
Denote the four monomer units by A, B, C, and D. The first 
reactions are carried out in four vertical columns, each 0.25 
cm wide. The first mask exposes the left-most column of 
boxes, where A is coupled. The second mask exposes the 
next column, where B is coupled; followed by a third mask, 
for the C column; and a final mask that exposes the right- 
most column, for D. The first, second, third, and fourth 
masks may be a single mask translated to different locations. 

The process is repeated in the horizontal direction for the 
second unit of the dimer. This time, the masks allow 
exposure of horizontal rows, again 0.25 cm wide. A, B, C, 
and D are sequentially coupled using masks that expose 
horizontal fourths of the reaction area. The resulting sub- 
strate contains all 16 dinucleotides of four bases. 

The eight masks used to synthesize the dinucleotide are 
related to one another by translation or rotation. In fact, one 
. mask can be used in all eight steps if it is suitably rotated and 
translated. For example, in the example above, a mask with 
a single transparent region could be sequentially used to 
expose each of the vertical columns, translated 90°, and then 
55 sequentially used to allow exposure of the horizontal rows. 

Tables 4 and 5 provide a simple computer program in 
Quick Basic for planning a masking program and a sample 
output, respectively, for the synthesis of a polymer chain of 
three monomers ("residues") having three different mono- 
mers in the first level, four different monomers in the second 
level, and five different monomers in the third level in a 
striped pattern. The output of the program is the number of 
cells, the number of "stripes" (light regions) on each mask, 
and the amount of translation required for each exposure of 
the mask. 
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TABLE 4 

Mask Strategy Program 

DEF1NTA-Z . 

DIM b(20), w(20), 1(500) 

F$ - "LPTi:" 

OPEN £S FOR OUTPUT AS #1 

jmax - 3 'Number of residues 

b(l) - 3: b(2) - 4: b(3) - 5 'Number of building blocks for res 1, 2, 3 

g - 1: lmax(l) - 1 

FOR j - 1 TO jmax: g- g • b(j): NEXT j 
w(0) - 0: w(l) - g/b(l) 

PRINT #1, "MASK2.BAS", DATES, TIME$: PRINT #1, 
PRINT #1, USING "Number of residues-**"; jmax 
FOR j - 1 TO jmax 

PRINT #1, USING " Residue ## ## building blocks"; j; b(j) 

NEXT j 
PRINT #1, " 

PRINT #1, USING "Number of cells-####"; g: PRINT #1, 

FOR j - 2 TO jmax 

lmaxQ) - lmax(j - 1) * b(j - 1) 

wQ) - wG - 1) / b(0 

NEXT j 

FOR j - 1 TOjmax 

PRINT #1, USING "Mask for residue ##"; j: PRINT #1, 
PRINT #1, USING " Number of stripcs-###"; 1 max© 
PRINT #1, USING " Width of each strtpe-###"; w(j) 
FOR 1-1 TO ImaxQ) 
a - 1 + (1 - 1) * w(j - 1) 
ae - a + w(j) - 1 

PRINT #1, USING " Stripe ## begins at location ### and ends at ###"; 1; a; ae 
NEXT 1 
PRINT #1, 

PRINT #1, USING " For each of ## building blocks, translate mask by ## 
cell(s)"; bO); w©, 

PRINT #1, : PRINT #1, : PRINT #1, 
NEXTj 

® Copyright 1990, Affymax N.V 
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TABLE 5 



Masking Strategy Output 
Number of residues- 3 

Residue 1 3 building blocks 

Residue 2 4 building blocks 

Residue 3 5 building blocks 

Number of cells- 60 
Mask for residue 1 

Number of stripes- 1 . 
Width of each stripe- 20 
Stripe 1 begins at location 1 and ends at 20 
For each of 3 building blocks, translate mask by 20 cell(s) 
Mask for residue 2 

Number of stripes- 3 
Width of each stripe- 5 
Stripe 1 begins at location 1 and ends at 5 
Stripe 2 begins at location 21 and ends at 25 
Stripe 3 begins at location 41 and ends at 45 
For each of 4 building blocks, translate mask by 5 ccll(s) 
Mask for residue 3 



Number of stripes- 12 

Width of each stripe- 1 

Stripe 1 begins at location 1 and ends at 1 

Stripe 2 begins at location 6 and ends at 6 

Stripe 3 begins at location 11 and ends at 11 

Stripe 4 begins at location 16 and ends at 16 
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TABLE 5 -continued 



Masking Strategy Output 



Stripe 5 begins at location 21 and ends at 21 

Stripe 6 begins at location 26 and ends at 26 

Stripe 7 begins at location 31 and ends at 31 

Stripe 8 begins at location 36 and ends al 36 

Stripe 9 begins at location 41 and ends at 41 

Stripe 10 begins at location 46 and ends at 46 

Stripe 11 begins at location 51 and ends at 51 

Stripe 12 begins at location 56 and ends at 56 

For each of 5 building blocks, translate mask by 1 cell(s) 



® Copyright 1990, Affymax N.V. 



V. Details of One Embodiment of A Fluorescent 
Detection Device 

FIG. 9 illustrates a fluorescent detection device for detect- 
ing fluorescently labeled receptors on a substrate. A sub- 
strate 112 is placed on an x/y translation table 202. In a 20 
preferred embodiment the x/y translation table is a model no. 
PM500-A1 manufactured by Newport Corporation. The x/y 
translation table is connected to and controlled by an appro- 
priately programmed digital computer 204 which may be, 
for example, an appropriately programmed IBM PC/AT or 
AT compatible computer. Of course, other computer 25 
systems, special purpose hardware, or the like could readily 
be substituted for the AT computer used herein for illustra- 
tion. Computer software for the translation and data collec- 
tion functions described herein can be provided based on 
commercially available software including, for example, 30 
"Lab Windows" licensed by National Instruments, which is 
incorporated herein by reference for all purposes. 

The substrate and x/y translation table are placed under a 
microscope 206 which includes one or more objectives 208. 
Light (about 488 nm) from a laser 210, which in some 
embodiments is a model no. 2020-05 argon ion laser manu- 
factured by Spectraphysics, is directed at the substrate by a 
dichroic mirror 207 which passes greater than about 520 nm 
light but reflects 488 nm light. Dichroic mirror 207 may be, 
for example, a model no. FT510 manufactured by Carl 
Zeiss. Light reflected from the mirror then enters the micro- 
scope 206 which may be, for example, a model no. Axioscop 
20 manufactured by Carl Zeiss. Fluoresce in -marked mate- 
rials on the substrate will fluoresce >488 nm light, and the 
fluoresced light will be collected by the microscope and 
passed through the mirror. The fluorescent light from the 
substrate is then directed through a wavelength filter 209 
and, thereafter through an aperture plate 211. Wavelength 
filter 209 may be, for example, a model no. OG530 manu- 
factured by Melles Griot and aperture plate 211 may be, for 
example, a model no. 477352/477380 manufactured by Carl 
Zeiss. 

The fluoresced light then eaters a photomultiplier tube 
212 which in some embodiments is a model no. R943-02 
manufactured by Hamamatsu, the signal is amplified in 
preamplifier 214 and photons are counted by photon counter 
216. The number of photons is recorded as a function of the 
location in the computer 204. Pre-Amp 214 may be, for 
example, a model no. SR440 manufactured by Stanford 
Research Systems and photon counter 216 may be a model 
no. SR400 manufactured by Stanford Research Systems. 
The substrate is then moved to a subsequent location and the 
process is repeated. In preferred embodiments the data are 
acquired every 1 to 100 with a data collection diameter 
of about 0.8 to 10 /an preferred. In embodiments with 
sufficiently high fluorescence, a CCD detector with broad- 
field illumination is utilized. 

By counting the number of photons generated in a given 
area in response to the laser, it is possible to determine where 



fluorescent marked molecules are located on the substrate. 
Consequently, for a slide which has a matrix of polypeptides, 
for example, synthesized on the surface thereof, it is possible 
to determine which of the polypeptides is complementary to 
a fluorescently marked receptor. 

According to preferred embodiments, the intensity and 
duration of the light applied to the substrate is controlled by 
varying the laser power and scan stage rate for improved 
signal-to-noise ratio by maximizing fluorescence emission 
and minimizing background noise. 

While the detection apparatus has been illustrated prima- 
rily herein with regard to the detection of marked receptors, 
the invention will find application in other areas. For 
example, the detection apparatus disclosed herein could be 
used in the fields of catalysis, DNAor protein gel scanning, 
and the like. 

VI. Determination of Relative Binding Strength of 

Receptors 

The signal-to-noise ratio of the present invention is suf- 
ficiently high that not only can the presence or absence of a 
receptor on a ligand be detected, but also the relative binding 
affinity of receptors to a variety of sequences can be deter- 
mined. 

In practice it is found that a receptor will bind to several 
peptide sequences in an array, but will bind much more 
strongly to some sequences than others. Strong binding 
affinity will be evidenced herein by a strong fluorescent or 
radiographic signal since many receptor molecules will bind 
in a region of a strongly bound ligand. Conversely, a weak 
binding affinity will be evidenced by a weak fluorescent or 
radiographic signal due to the relatively small number of 
receptor molecules which bind in a particular region of a 
substrate having a ligand with a weak binding affinity for the 
receptor, consequently, it becomes possible to determine 
50 relative binding avidity (or affinity in the case of univalent 
interactions) of a ligand herein by way of the intensity of a 
fluorescent or radiographic signal in a region containing that 
ligand. 

Semiquantitative data on affinities might also be obtained 
55 by varying washing conditions and concentrations of the 
receptor. This would be done by comparison to known 
ligand receptor pairs, for example. 

VII. Examples 

60 The following examples are provided to illustrate the 
efficacy of the inventions herein. All operations were con- 
ducted at about ambient temperatures and pressures unless 
indicated to the contrary. 
A. Slide Preparation 
65 Before attachment of reactive groups it is preferred to 
clean the substrate which is, in a preferred embodiment a 
glass substrate such as a microscope slide or cover slip. 
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According to one embodiment the slide is soaked in an 
alkaline bath consisting of, for example, 1 liter of 95% 
ethanol with 120 ml of water and 120 grams of sodium 
hydroxide for 12 hours. The slides are then washed under 
running water and allowed to air dry, and rinsed once with 
a solution of 95% ethanol. 

The slides are then aminated with, for example, amino- 
propyltriethoxysilane for the purpose of attaching amino 
groups to the glass surface on linker molecules, although any 
omega functionalized silane could also be used for this 
purpose. In one embodiment 0.1% aminopropyltriethoxysi- 
lane is utilized, although solutions with concentrations from 
10" 7 % to 10% may be used, with about 10~ 3 % to 2% 
preferred. A 0.1% mixture is prepared by adding to 100 ml 
of a 95% ethanol/5% water mixture, 100 microliters (/A) of 
aminopropyltriethoxysilane. The mixture is agitated at about 
ambient temperature on a rotary shaker for about 5 minutes. 
500 /d of this mixture is then applied to the surface of one 
side of each cleaned slide. After 4 minutes, the slides are 
decanted of this solution and rinsed three times by dipping 
in, for example, 100% ethanol. 

After the plates dry, they are placed in a 110-120° C. 
vacuum oven for about 20 minutes, and then allowed to cure 
at room temperature for about 12 hours in an argon envi- 
ronment. The slides are then dipped into DMF 
(dimethylformamide) solution, followed by a thorough 
washing with methylene chloride. 

The aminated surface of the slide is then exposed to about 
500 [A of, for example, a 30 millimolar (mM) solution of 
NVOC-GABA (gamma amino butyric acid) NHS 
(N-hydroxysuccinimide) in DMF for attachment of a 
NVOC-GABA to each of the amino groups. 

The surface is washed with, for example, DMF, methyl- 
ene chloride, and ethanol. 

Any unreacted aminopropyl silane on the surface — that is, 
those amino groups which have not had the NVOC-GABA 
attached — are now capped with acetyl groups (to prevent 
further reaction) by exposure to a 1:3 mixture of acetic 
anhydride in pyridine for 1 hour. Other materials which may 
perform this residual capping function include trifluoroace- 
tic anhydride, formicacetic anhydride, or other reactive 
acylating agents. Finally, the slides are washed again with 
DMF, methylene chloride, and ethanol. 
B. Synthesis of Eight Trimers of "A" and "B" 

FIG. 10 illustrates a possible synthesis of the eight trimers 
of the two-monomer set: gly, phe (represented by "A" and 
"B," respectively). A glass slide bearing silane groups ter- 
minating in 6-nitroveratryloxycarboxamide (NVOC-NH) 
residues is prepared as a substrate. Active esters 
(pentafluorophenyl, OBt, etc.) of gly and phe protected at the 
amino group with NVOC are prepared as reagents. While 
not pertinent to this example, if side chain protecting groups 
are required for the monomer set, these must not be photo- 
reactive at the wavelength of light used to protect the 
primary chain. 

For a monomer set of size n, nxl cycles are required to 
synthesize all possible sequences of length 1. A cycle con- 
sists of: 

1. Irradiation through an appropriate mask to expose the 
amino groups at the sites where the next residue is to be 
added, with appropriate washes to remove the 
by-products of the deprotection. 

2. Addition of a single activated and protected (with the 
same photochemically-removable group) monomer, 
which will react only at the sites addressed in step 1, 
with appropriate washes to remove the excess reagent 
from the surface. 

The above cycle is repeated for each member of the 
monomer set until each location on the surface has been 
extended by one residue in one embodiment. In other 



embodiments, several residues are sequentially added at one 
location before moving on to the next location. Cycle times 
will generally be limited by the coupling reaction rate, now 
as short as 20 min in automated peptide synthesizers. This 
5 step is optionally followed by addition of a protecting group 
to stabilize the array for later testing. For some types of 
polymers (e.g., peptides), a final deprotection of the entire 
surface (removal of photoprotective side chain groups) may 
be required. 

More particularly, as shown in FIG. 10A, the glass 20 is 
provided with regions 22, 24, 26, 28, 30, 32, 34, and 36. 
Regions 30, 32, 34, and 36 are masked, as shown in FIG. 
10B and the glass is irradiated and exposed to a reagent 
containing "A" (e.g., gly), with the resulting structure shown 
in FIG. IOC. Thereafter, regions 22, 24, 26, and 28 are 

15 masked, the glass is irradiated (as shown in FIG. 10D) and 
exposed to a reagent containing "B" (e.g., phe), with the 
resulting structure shown in FIG. 10E. The process 
proceeds, consecutively masking and exposing the sections 
as shown until the structure shown in FIG. 10M is obtained. 

20 The glass is irradiated and the terminal groups are, 
optionally, capped by acetylation. As shown, all possible 
trimers of gly/phe are obtained. 

In this example, no side chain protective group removal is 
necessary. If it is desired, side chain deprotection may be 

25 accomplished by treatment with ethanedithiol and trifluoro- 
acetic acid. 

In general, the number of steps needed to obtain a 
particular polymer chain is defined by: 
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nxl 



(1) 



where: 

n=the number of monomers in the basis set of monomers, 
and 

35 l=the number of monomer units in a polymer chain. 

Conversely, the synthesized number of sequences of 
length 1 will be: 



(2) 



Of course, greater diversity is obtained by using masking 
strategies which will also include the synthesis of polymers 
having a length of less than 1. If, in the extreme case, all 
polymers having a length less than or equal to 1 are 
4 5 synthesized, the number of polymers synthesized will be: 



+n 



(3) 



The maximum number of lithographic steps needed will 
generally be n for each "layer" of monomers, i.e., the total 
50 number of masks (and, therefore, the number of lithographic 
steps) needed will be nxl. The size of the transparent mask 
regions will vary in accordance with the area of the substrate 
available for synthesis and the number of sequences to be 
formed. In general, the size of the synthesis areas will be: 

55 

size of synthesis areas»(A)/(S) 

where: 

A is the total area available for synthesis; and 
S is the number of sequences desired in the area. 
It will be appreciated by those of skill in the art that the 
above method could readily be used to simultaneously 
produce thousands or millions of oligomers on a substrate 
using the photolithographic techniques disclosed herein. 
Consequently, the method results in the ability to practically 
65 test large numbers of, for example, di, tri, tetra, penta, hexa, 
hepta, octapeptides, dodecapeptides, or larger polypeptides 
(or correspondingly, polynucleotides). 
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The above example has illustrated the method by way of 
a manual example. It will of course be appreciated that 
automated or semi-automated methods could be used. The 
substrate would be mounted in a flow cell for automated 
addition and removal of reagents, to minimize the volume of 5 
reagents needed, and to more carefully control reaction 
conditions. Successive masks could be applied manually or 
automatically. 

C. Synthesis of a D inter of an Aminopropyl Group and a 
Fluorescent Group 

In synthesizing the dimer of an aminopropyl group and a io 
fluorescent group, a functionalized durapore membrane was 
used as a substrate. The durapore membrane was a polyvi- 
nylidine difluoride with aminopropyl groups. The amino- 
propyl groups were protected with the DDZ group by 
reaction of the carbonyl chloride with the amino groups, a 15 
reaction readily known to those of skill in the art. The 
surface bearing these groups was placed in a solution of THF 
and contacted with a mask bearing a checkerboard pattern of 
1 mm opaque and transparent regions. The mask was 
exposed to ultraviolet light having a wavelength down to at 
least about 280 nm for about 5 minutes at ambient 20 
temperature, although a wide range of exposure times and 
temperatures may be appropriate in various embodiments of 
the invention. For example, in one embodiment, an exposure 
time of between about 1 and 5000 seconds may be used at 
process temperatures of between -70 and +50° C. 25 

In one preferred embodiment, exposure times of between 
about 1 and 500 seconds at about ambient pressure are used. 
In some preferred embodiments, pressure above ambient is 
used to prevent evaporation. 

The surface of the membrane was then washed for about 30 
1 hour with a fluorescent label which included an active ester 
bound to a chelate of a lanthanide. Wash times will vary over 
a wide range of values from about a few minutes to a few 
hours. These materials fluoresce in the red and the green 
visible region. After the reaction with the active ester in the 35 
fluorophore was complete, the locations in which the fluo- 
rophore was bound could be visualized by exposing them to 
ultraviolet light and observing the red and the green fluo- 
rescence. It was observed that the derivatized regions of the 
substrate closely corresponded to the original pattern of the 
mask. 40 

D. Demonstration of Signal Capability 

Signal detection capability was demonstrated using a 
low-level standard fluorescent bead kit manufactured by 
Flow Cytometry Standarda and having model no. 824. This 
kit includes 5.8 //m diameter beads, each impregnated with 45 
a known number of fluorescein molecules. 

One of the beads was placed in the illumination field on 
the scan stage as shown in FIG. 9 in a field of a laser spot 
which was initially shuttered. After being positioned in the 
illumination field, the photon detection equipment was 50 
turned on. The laser beam was unblocked and it interacted 
with the particle bead, which then fluoresced. Fluorescence 
curves of beads impregnated with 7,000; 13,000; and 29,000 
fluorescein molecules, are shown in FIGS. 11A, 11B, and 
11C respectively. On each curve, traces for beads without 55 
fluorescein molecules are also shown. These experiments 
were performed with 488 nm excitation, with 100 /*W of 
laser power. The light was focused through a 40 power 0.75 
NA objective. 

The fluorescence intensity in all cases started off at a high 
value and then decreased exponentially. The fall-off in 60 
intensity is due to photobleaching of the fluorescein mol- 
ecules. The traces of beads without fluorescein molecules 
are used for background subtraction. The difference in the 
initial exponential decay between labeled and nonlabeled 
beads is integrated to give the total number of photon counts, 65 
and this number is related to the number of molecules per 
bead. Therefore, it is possible to deduce the number of 



photons per fluorescein molecule that can be detected. For 
the curves illustrated in FIG. 11, this calculation indicates 
the radiation of about 40 to 50 photons per fluorescein 
molecule are detected. 

E. Determination of the Number of Molecules Per Unit Area 
Aminopropylated glass microscope slides prepared 

according to the methods discussed above were utilized in 
order to establish the density of labeling of the slides. The 
free amino termini of the slides were reacted with FITC 
(fluorescein isothiocyanate) which forms a covalent linkage 
with the amino group. The slide is then scanned to count the 
number of fluorescent photons generated in a region which, 
using the estimated 40-50 photons per fluorescent molecule, 
enables the calculation of the number of molecules which 
are on the surface per unit area. 

A slide with aminopropyl silane on its surface was 
immersed in a 1 mM solution of FITC in DMF for 1 hour at 
about ambient temperature. After reaction, the slide was 
washed twice with DMF and then washed with ethanol, 
water, and then ethanol again. It was then dried and stored 
in the dark until it was ready to be examined. 

Through the use of curves similar to those shown in FIG. 
11, and by integrating the fluorescent counts under the 
exponentially decaying signal, the number of free amino 
groups on the surface after derivitization was determined. It 
was determined that slides with labeling densities of 1 
fluorescein per 10 3 xl0 3 to -2x2 nm could be reproducibly 
made as the concentration of arainopropyltriethoxysilane 
varied from 10~ 5 % to 10 _1 %. 

F. Removal of NVOC and Attachment of a Fluorescent 
Marker 

NVOC-GABA groups were attached as described above. 
The entire surface of one slide was exposed to light so as to 
expose a free amino group at the end of the gamma amino 
butyric acid. This slide, and a duplicate which was not 
exposed, were then exposed to fluorescein isothiocyanate 
(FITC). 

FIG. 12A illustrates the slide which was not exposed to 
light, but which was exposed to FITC. The units of the x axis 
are time and the units of the y axis are counts. The trace 
contains a certain amount of background fluorescence. The 
duplicate slide was exposed to 350 nm broadband illumi- 
nation for about 1 minute (12 mW/cm 2 , ~350 nm 
illumination), washed and reacted with FITC. The fluores- 
cence curves for this slide are shown in FIG. 12B. A large 
increase in the level of fluorescence is observed, which 
indicates photolysis has exposed a number of amino groups 
on the surface of the slides for attachment of a fluorescent 
marker. 

G. Use of a Mask in Removal of NVOC 

The next experiment was performed with a 0.1% amino- 
propylated slide. Light from a Hg — Xe arc lamp was imaged 
onto the substrate through a laser-ablated chrome-on-glass 
mask in direct contact with the substrate. 

This slide was illuminated for approximately 5 minutes, 
with 12 mW of 350 nm broadband light and then reacted 
with the 1 mM FITC solution. It was put on the laser 
detection scanning stage and a graph was plotted as a 
two-dimensional representation of position versus fluores- 
cence intensity. The fluorescence intensity (in counts) as a 
function of location is given on the scale to the right of FIG. 
13A for a mask having 100x100 pm squares. 

The experiment was repeated a number of times through 
various masks. The fluorescence pattern for a 50 fim mask is 
illustrated in FIG. 13B, for a 20 ^m mask in FIG. 13C, and 
for a 10 /ma mask in FIG. 13D. The mask pattern is distinct 
down to at least about 10 ftm squares using this lithographic 
technique. 

H. Attachment of YGGFL and Subsequent Exposure to Herz 
Antibody and Goat Antimouse 

In order to establish that receptors to a particular polypep- 
tide sequence would bind to a surface-bound peptide and be 
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detected, Leu enkephalin was coupled to the surface and 
recognized by an antibody. A slide was derivatized with 

0. 1% amino propyl-triethoxysilane and protected with 
NVOC. A 500 fim checkerboard mask was used to expose 
the slide in a flow cell using backside contact printing. The 
Leu enkephalin sequence (H 2 N-tyrosine,glycine,glycine, 
phenylalanine,leucine-C0 2 H, otherwise referred to herein as 
YGGFL) was attached via its carboxy end to the exposed 
amino groups on the surface of the slide. The peptide was 
added in DMF solution with the BOP/HOBT/DIEA cou- 
pling reagents and recirculated through the flow cell for 2 10 
hours at room temperature. 

A first antibody, known as the Herz antibody, was applied 
to the surface of the slide for 45 minutes at 2 /*g/ml in a 
supercocktail (containing 1% BSA and 1% ovalbumin also 
in this case). A second antibody, goat anti-mouse fluorescein 15 
conjugate, was then added at 2 /*g/ml in the supercocktail 
buffer, and allowed to incubate for 2 hours. 

The results of this experiment are provided in FIG. 14. 
Again, this figure illustrates fluorescence intensity as a 
function of position. The fluorescence scale is shown on the 
right. This image was taken at 10 fim steps. This figure 20 
indicates that not only can deprotection be carried out in a 
well defined pattern, but also that (1) the method provides 
for successful coupling of peptides to the surface of the 
substrate, (2) the surface of a bound peptide is available for 
binding with an antibody, and (3) that the detection appa- 25 
ratus capabilities are sufficient to detect binding of a recep- 
tor. 

1. Monomer-by-Monomer Formation of YGGFL and Sub- 
sequent Exposure to Labeled Antibody 

Monomer-by-monomer synthesis of YGGFL and GGFL 30 
in alternate squares was performed on a slide in a checker- 
board pattern and the resulting slide was exposed to the Herz 
antibody. This experiment and the results thereof are illus- 
trated in FIGS. 15A, 15B, 15C, and 15D. 

In FIG. ISA, a slide is shown which is derivatized with 35 
the aminopropyl group, protected in this case with t-BOC 
(t-butoxycarbonyl). ITie slide was treated with TFA to 
remove the t-BOC protecting group. E-aminocaproic acid, 
which was t-BOC protected at its amino group, was then 
coupled onto the aminopropyl groups. The aminocaproic 
acid serves as a spacer between the aminopropyl group and 40 
the peptide to be synthesized. The amino end of the spacer 
was deprotected and coupled to NVOC-leucine. The entire 
slide was then illuminated with 12 mW of 325 nm broad- 
band illumination. The slide was then coupled with NVOC- 
phenylalanine and washed. The entire slide was again 45 
illuminated, then coupled to NVOC-glycine and washed. 
The slide was again illuminated and coupled to NVOC- 
glycine to form the sequence shown in the last portion of 
FIG. 15A. 

As shown in FIG. 15B, alternating regions of the slide 50 
were then illuminated using a projection print using a 
500x500 fim checkerboard mask; thus, the amino group of 
glycine was exposed only in the lighted areas. When the next 
coupling chemistry step was carried out, NVOC-tyrosine 
was added, and it coupled only at those is spots which had 55 
received illumination. The entire slide was then illuminated 
to remove all the NVOC groups, leaving a checkerboard of 
YGGFL in the lighted areas and in the other areas, GGFL. 
The Herz antibody (which recognizes the YGGFL, but not 
GGFL) was then added, followed by goat anti-mouse fluo- 
rescein conjugate. 

The resulting fluorescence scan is shown in FIG. 15C, and 
the scale for the fluorescence intensity is again given on the 
right. Dark areas contain the tetrapeptide GGFL, which is 
not recognized by the Herz antibody (and thus there is no 
binding of the goat anti-mouse antibody with fluorescein 65 
conjugate), and in the red areas YGGFL is present. The 
YGGFL pentapeptide is recognized by the. Herz antibody 
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and, therefore, there is antibody in the lighted regions for the 
fluorescein-conjugated goat anti-mouse to recognize. 

Similar patterns are shown for a 50 /an mask used in 
direct contact ("proximity print") with the substrate in FIG. 
15D. Note that the pattern is more distinct and the corners 
of the checkerboard pattern are touching when the mask is 
placed in direct contact with the substrate (which reflects the 
increase in resolution using this technique). 
J. Monomer-by-Monomer Synthesis of YGGFL and PGGFL 
A synthesis using a 50 fxm checkerboard mask similar to 
that shown in FIG. 15 was conducted. However, P was added 
to the GGFL sites on the substrate through an additional 
coupling step. P was added by exposing protected GGFL to 
light through a mask, and subsequence exposure to P in the 
manner set forth above. Therefore, half of the regions on the 
substrate contained YGGFL and the remaining half con- 
tained PGGFL. 

The fluorescence plot for this experiment is provided in 
FIG. 16. As shown, the regions are again readily discernable. 
This experiment demonstrates that antibodies are able to 
recognize a specific sequence and that the recognition is not 
length-dependent. 

K. Monomer-by-Monomer Synthesis of YGGFL and YPG- 
GFL 

In order to further demonstrate the operability of the 
invention, a 50 /mi checkerboard pattern of alternating 
YGGFL and YFGGFL was synthesized on a substrate using 
techniques like those set forth above. The resulting fluores- 
cence plot is provided in FIG. 17. Again, it is seen that the 
antibody is clearly able to recognize the YGGFL sequence 
and does not bind significantly at the YPGGFL regions. 
L. Synthesis of an Array of Sixteen Different Amino Acid 
Sequences and Estimation of Relative Binding Affinity to 
Herz Antibody 

Using techniques similar to those set forth above, an array 
of 16 different amino acid sequences (replicated four times) 
was synthesized on each of two glass substrates. The 
sequences were synthesized by attaching the sequence 
NVOC-GFL across the entire surface of the slides. Using a 
series of masks, two layers of amino acids were then 
selectively applied to the substrate. Each region had dimen- 
sions of 0.25 cmxO.0625 cm. The first slide contained amino 
acid sequences containing only L amino acids while the 
second slide contained selected D amino acids. FIGS. 18A 
and 18B illustrate a map of the various regions on the first 
and second slides, respectively. The patterns shown in FIGS. 
18A and 18B were duplicated four times on each slide. The 
slides were then exposed to the Herz antibody and 
fluorescein-labeled goat anti-mouse. 

FIG. 19 is a fluorescence plot of the first slide, which 
contained only L amino acids. Red indicates strong binding 
(149,000 counts or more) while black indicates little or no 
binding of the Herz antibody (20,000 counts or less). The 
bottom right-hand portion of the slide appears "cut off" 
because the slide was broken during processing. The 
sequence YGGFL is clearly most strongly recognized. The 
sequences YAGFL and YSGFL also exhibit strong recogni- 
tion of the antibody. By contrast, most of the remaining 
sequences show little or no binding. The four duplicate 
portions of the slide are extremely consistent in the amount 
of binding shown therein. 

FIG. 20 is a fluorescence plot of the second slide. Again, 
strongest binding is exhibited by the YGGFL sequence. 
Significant binding is also detected to YaGFL, YsGFL, and 
YpGFL. The remaining sequences show less binding with 
the antibody. Note the low binding efficiency of the 
sequence yGGFL. 

Table 6 lists the various sequences tested in order of 
relative fluorescence, which provides information regarding 
relative binding affinity. 
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TABLE 6 



Apparent Binding to Hera Ab 



L-fl.a. Set 



D-tua. Set 



YGGFL 


YGGFL 


YAGFL 


YaGFL 


YSGFL 


YsGFL 


LGGFL 


YpGFL 


PGGFL 


fGGFL 


YPGFL 


yGGFL 


LAGFL 


faGFL 


FAGFL 


WGGFL 


WGGFL 


yaGFL 




fpGFL 




WaGFL 



VIII. Illustrative Alternative Embodiment 

According to an alternative embodiment of the invention, 
the methods provide for attaching to the surface a caged 
binding member which in its caged form has a relatively low 
affinity for other potentially binding species, such as recep- 
tors and specific binding substances. Such techniques are 
more fully described in copending application Ser. No. 
404,920, filed Sep. 8, 1989, and incorporated herein by 
reference for all purposes. 

According to this alternative embodiment, the invention 
provides methods for forming predefined regions on a 
surface of a solid support, wherein the predefined regions are 
capable of immobilizing receptors. The methods make use 
of caged binding members attached to the surface to enable 
selective activation of the predefined regions. The caged 
binding members are liberated to act as binding members 
ultimately capable of binding receptors upon selective acti- 
vation of the predefined regions. The activated binding 
members are then used to immobilize specific molecules 
such as receptors on the predefined region of the surface. 
The above procedure is repeated at the same or different sites 
on the surface so as to provide a surface prepared with a 
plurality of regions on the surface containing, for example, 
the same or different receptors. When receptors immobilized 
in this way have a differential affinity for one or more 
ligands, screenings and assays for the ligands can be con- 
ducted in the regions of the surface containing the receptors. 

The alternative embodiment may make use of novel caged 
binding members attached to the substrate. Caged 
(unactivated) members have a relatively low affinity for 
receptors of substances that specifically bind to uncaged 
binding members when compared with the corresponding 
affinities of activated binding members. Thus, the binding 
members are protected from reaction until a suitable source 
of energy is applied to the regions of the surface desired to 
be activated. Upon application of a suitable energy source, 
the caging groups labilize, thereby presenting the activated 
binding member. A typical energy source will be light. 

Once the binding members on the surface are activated 
they may be attached to a receptor. The receptor chosen may 
be a monoclonal antibody, a nucleic acid sequence, a drug 
receptor, etc. The receptor will usually, though not always, 
be prepared so as to permit attaching it, directly or indirectly, 
to a binding member. For example, a specific binding 
substance having a strong binding affinity for the binding 
member and a strong affinity for the receptor or a conjugate 
of the receptor may be used to act as a bridge between 
binding members and receptors if desired. The method uses 
a receptor prepared such that the receptor retains its activity 
toward a particular ligand. 

Preferably, the caged binding member attached to the 
solid substrate will be a photoactivatable biotin complex, 
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i.e., a biotin molecule that has been chemically modified 
with photoactivatable protecting groups so that it has a 
significantly reduced binding affinity for avidin or avidin 
analogs than does natural biotin. In a preferred embodiment, 
the protecting groups localized in a predefined region of the 
surface will be removed upon application of a suitable 
source of radiation to give binding members, that are biotin 
or a functionally analogous compound having substantially 
the same binding affinity for avidin or avidin analogs as does 
biotin. 

In another preferred embodiment, avidin or an avidin 
analog is incubated with activated binding members on the 
surface until the avidin binds strongly to the binding mem- 
bers. The avidin so immobilized on predefined regions of the 
surface can then be incubated with a desired receptor or 
conjugate of a desired receptor. The receptor will preferably 
be biotinylated, e.g., a biotinylated antibody, when avidin is 
immobilized on the predefined regions of the surface. 
Alternatively, a preferred embodiment will present an 
avidin/biotinylated receptor complex, which has been pre- 
viously prepared, to activated binding members on the 
surface. 

IX. Conclusion 

The present inventions provide greatly improved methods 
and apparatus for synthesis of polymers on substrates. It is 
to be understood that the above description is intended to be 
illustrative and not restrictive. Many embodiments will be 
apparent to those of skill in the art upon reviewing the above 
description. By way of example, the invention has been 
described primarily with reference to the use of photore- 
movable protective groups, but it will be readily recognized 
by those of skill in the art that sources of radiation other than 
light could also be used. For example, in some embodiments 
it may be desirable to use protective groups which are 
35 sensitive to electron beam irradiation, x-ray irradiation, in 
combination with electron beam lithograph, or x-ray lithog- 
raphy techniques. Alternatively, the group could be removed 
by exposure to an electric current. The scope of the invention 
should, therefore, be determined not with reference to the 
above description, but should instead be determined with 
reference to the appended claims, along with the full scope 
of equivalents to which such claims are entitled. 
What is claimed is: 

1. An array of oligonucleotides, the array comprising: 
a planar solid support having at least a first surface; and 
a plurality of different oligonucleotides attached to the 

first surface of the solid support at a density exceeding 
400 different oligonucleotides/cm 2 , wherein each of the 
different oligonucleotides is attached to the surface of 
the solid support in a different known location, and has 
a different determinable sequence. 

2. The array of claim 1, wherein each different oligo- 
nucleotides is from about 4 to about 20 nucleotides in length. 

3. The array of claim 1, wherein each different oligo- 
nucleotide is at least 12 nucleotides in length. 

4. The array of claim 1, wherein each different oligo- 
nucleotide is 2-100 nucleotides in length. 

5. The array of claim 1, wherein the array comprises at 
least 1,000 different oligonucleotides attached to the first 
surface of the solid support. 

6. The array of claim 1, wherein the array comprises at 
least 10,000 different oligonucleotides attached to the first 
surface of the solid support. 

7. The array of claim 1, wherein each of the different 
known locations is physically separated from each other of 
the known locations. 

8. The array of claim 1, wherein said planar solid support 
is glass. 
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9. The array of claim 1, wherein said oligonucleotides are 
attached to the first surface of the solid support through a 
linker group. 

10. The array of claim 1, wherein the oligonucleotide in 
the different known locations are at least 20% pure. 

11. The array of claim 1, wherein the oligonucleotides in 
the different known locations are at least 50% pure. . 

12. The array of claim 1, wherein the oligonucleotide in 
the different known locations are at least 80% pure. 

13. The array of claim 1, wherein the oligonucleotide in 
the different known locations are at least 90% pure. 

14. The array of claim 1, wherein said array is produced 
by a binary synthesis process, said process comprising the 
steps of: 

providing a planar solid support, said solid support having 
a plurality of compounds immobilized on a surface 
thereof, said compounds having protecting groups 
coupled thereto; 

deprotecting a first portion of said plurality of compounds 
on said surface and not a second portion of said 
plurality of compounds; 

reacting said first portion of said plurality of compounds 
with a first component of said oligonucleotide; 

deprotecting at least a third portion of said plurality of 
compounds on said surface, said third portion compris- 25 
ing a fraction of said first portion of said plurality of 
compounds; 

reacting said at least third portion of said plurality of 
compounds with a second component of said oligo- 
nucleotide; and 

optionally repeating said binary synthesis steps to produce 
said oligonucleotide array. 

15. An array of nucleic acids, the array comprising: 

a planar support having at least a first surface; and 

a plurality of different nucleic acids attached to the first 
surface of the solid support at a density exceeding 400 
different nucleic acids/cm 2 , wherein each of the differ- 
ent nucleic acids is attached to the surface of the solid 
support in a different known location, has a different 40 
determinable sequence, wherein the different nucleic 
adds in the different known locations are at least 10% 
pure. 

16. The array of claim 15, wherein each different nucleic 
acid is at least 20 nucleotides in length. 45 

17. The array of claim 15, wherein the array comprises at 
least 1,000 different nucleic acids attached to the first surface 
of the solid support. 

18. The array of claim 15, wherein the array comprises at 
least 10,000 different nucleic acids attached to the first 
surface of the solid support. 

19. The array of claim 15, wherein each of the different 
known locations is physically separated from each of the 
other known locations. 

20. The array of claim 15, wherein said planar solid 
support is glass. 

21. The array of claim 15, wherein said nucleic acids are 
attached to the first surface of the solid support through a 
linker group. 

22. The array of claim 15, wherein the nucleic acids in the 
different known locations comprise nucleic acids that are at 60 
least 20% pure. 

23. The array of claim 15, wherein the nucleic acid in the 
different known locations comprise nucleic acids that are at 
least 50% pure. 
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24. The array of claim 15, wherein the nucleic acids in the 
different known locations are at least 80% pure. 

25. The array of claim 15, the nucleic acids in the different 
known locations are at least 90% pure. 

26. The array of claim 15, wherein said array is produced 
by a binary synthesis process, said process comprising the 
steps of: 

providing a planar, solid support, said solid support hav- 
ing a plurality of compounds immobilized on a surface 
thereof, said compounds having protecting groups 
coupled thereto; deprotecting a first portion of said 
plurality of compounds on said surface and not a 
second portion of said plurality of compounds; 
reacting said first portion of said plurality of compounds 

with a first reactant; 
deprotecting at least a third portion of said plurality of 
compounds on said surface, said third portion compris- 
ing a fraction of said first portion of said plurality of 
compounds; 

reacting said at least third portion of said plurality of 
compounds with a second reactant; and 

optionally repeating said binary synthesis steps to produce 
said nucleic acid array. 

27. The array of claim 15, wherein the nucleic acids are 
covalently attached to the support. 

28. An array of nucleic acids, the array comprising: 
a planar support having at least a first surface; and 

a plurality of different nucleic acids attached to the first 
surface of the solid support at a density exceeding 
10,000 different nucleic acids/cm 2 , wherein each of the 
different nucleic acids is attached to the surface of the 
solid support in a different known location, and has a 
different determinable sequence. 

29. An array of nucleic acids, the an-ay comprising: 
a planar support having at least a first surface; and 

a plurality of different nucleic acids attached to the first 
surface of the solid support at a density exceeding 400 
different nucleic acids/cm 2 , wherein each of the differ- 
ent nucleic acids is attached to the surface of the solid 
support in a different know location, has a different 
determinable sequence, wherein the surface and the 
support are made from different materials. 

30. The array of claim 15, wherein the different known 
locations are square in shape. 

31. The array of claim 15, wherein the substrate is glass. 

32. The array of claim 15, wherein the substrate is silicon 
dioxide. 

33. The array of claim 15, wherein the substrate is 
(poly)tetrafluoroethylene, (poly)vinylidenedifluoride, poly- 
styrene or polycarbonate. 

34. The method of claim 15, wherein the substrate is 
optically transparent. 

35. The array of claim 15, where in the substrate is 
functionalized with groups that attach to the plurality of 
different nucleic acids. 

36. The array of claim 1, wherein the plurality of different 
oligontucleotides have known sequences. 

37. The array of claim 15, wherein the plurality of 
different nucleic acids have known sequences. 

38. The array of claim 28, wherein the plurality of 
different nucleic acids have known sequences. 

39. The array of claim 29, wherein the plurality of 
different nucleic acids have known sequences. 
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The human Genome 



A 2.91-billion base pair (bp) const ^Rjuence 
the human genome was generated by the whole-genome shotgun sequencing 
method The 14.8-billion bp DNA sequence was generated over 9 months from 
27 271 853 high-quality sequence reads (5.1 1-fold coverage of the genome) 
from both ends of plasmid clones made from the DNA of five individuals. Two 
assembly strategies— a whole-genome assembly and a regional chromosome 
assembly-were used, each combining sequence data from Celera and the 
publicly funded genome effort. The public data were shredded into 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
sequenced, without including biases inherent in the cloning and assembly 
procedure used by the publicly funded group. This brought the effective cov- 
erage in the assemblies to eightfold, reducing the number and size of gaps in 
the final assembly over what would be obtained with 5.11-fold coverage. The 
two assembly strategies yielded very similar results that largely agree with 
independent mapping data. The assemblies effectively cover the euchromat.c 
regions of the human chromosomes. More than 90% of the genome is in 
scaffold assemblies of 100,000 bp or more, and 25% of the genome is in 
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 
26 588 protein-encoding transcripts for which there was strong corroborating 
evidence and an additional -12,000 computationally derived genes with mouse 
matches or other weak supporting evidence. Although gene-dense clusters are 
obvious, almost half the genes are dispersed in low G+C sequence separated 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome 
is spanned by exons, whereas 24% is in introns, with 75% of the genome being 
intergenic DNA. Duplications of segmental blocks, ranging in size up to chro- 
mosomal lengths, are abundant throughout the genome and reveal a complex 
evolutionary history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function, with tissue-specific de- 
velopmental regulation, and with the hemostasis and immune systems. DNA 
sequence comparisons between the consensus sequence and publ.cly funded 
genome data provided locations of 2.1 million single-nucleotide polymorphisms 
{SNPs) A random pair of human haploid genomes differed at a rate of 1 bp per 
1250 on average, but there was marked heterogeneity in the level of poly- 
morphism across the genome. Less than 1% of all SNPs resulted in variation in 
proteins, but the task of determining which SNPs have functional consequences 
remains an open challenge. 
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Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the causation 
of disease, and the interplay between the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human genome was first for- 
mally proposed in 1985 (/). In subsequent 
years, the idea met with mixed reactions in 
the scientific community (2). However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for deterrriining the order of nucleotides of 



iing chain-terminating nucleotide ana- 
logs^). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides, 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 
when the sequences of two genes were obtained 
with this new technology (6*). From early se- 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed from RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of . the ex- 
pressed sequence tag (EST) method of gene 
identification (5), which is a random selection, 
very high throughput sequencing approach to 
characterize cDNA libraries. The EST method 
led to the rapid discovery and mapping of hu- 
man genes (9). The increasing numbers of hu- 
man EST sequences necessitated the develop- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
bly and analysis of hundreds of thousands of 
ESTs. This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (10). 

The complete 49-kbp bacteriophage lamb- 
da genome sequence was determined by a 
shotgun restriction digest method in 1982 
( 1 1). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (12), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994, 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possible with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(13). The experience with several subsequent 
genome-sequencing efforts established the 
broad applicability of this approach (14, 15). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion (16) of an approach to simulta- 
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neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(17, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification, of the 
BAC end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaliana genome (19). 

In 1997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
. . human genome. Their proposal was not well 
. received (27). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear, that the rate of progress 
in human genome sequencing worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA' 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with . 
: * the 3700 DNA Analyzer and the whole-genome , 
shotgun sequencing techniques developed at 
TIGR (23). Many of the principles of operation 
of a genome-sequencing facility were estab- - 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in trie public genome effort subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to —5-fold 



coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by the publicly 
. funded genome effort (30) to accelerate the 
project We also abandoned the quarterly an- 
. nouncements in the absence of interim assem- 
- blies to report 

Although this strategy provided a reasbn- 
. able result very early that was consistent with a 
. whole-genome- shotgun assembly with eight- 
. fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
. erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
-. the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
. assembly reported here was completed 1 Octo- 
. ber 2000. Here we describe the whole-genome 
1 random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
. ent assembly approaches for assembling the —3 
billion bp that make up the 23 pairs of chromo- - 
- somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
v potential bias to the final sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correctly, 
and accurately - assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods.- 
Figure 1 (see fold-out chart associated with . 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 



1 . Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome- Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods s 

Summary. This section discusses the rationale 
-and ethical rules governing donor selection to 
ensure ethnic and gender diversity alon^ witf, 

- -;.the methodologies -for DNA extraction and I*. 

brary construction. The plasmid library con- 
struction is the first critical step in shotgun 
sequencing. If the DNA libraries are not uni. 

. * .form in size, nonchimeric, and do not randomly 
represent the genome, then the subsequent stern 

. cannot accurately reconstruct the genome se- 
quence. We used automated high-throughput 
DNA sequencing and the computational infra. 

: structure to enable efficient tracking of enor- 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of se- 
quence). Sequencing and tracking from both 

. ends of plasmid clones from 2-, 1 0-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 

. : indicates that the accurate pairing rate of end 

; 'sequences was greater than 98%. 

Various policies of the United States and the 
i World Medical Association, specifically the . 
-Declaration of Helsinki, offer recommenda- 
• tions for conducting experiments with human 
subjects. We convened an Institutional Rc- 
> -view Board (IRB) (31) that helped us estab- 
1 lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for ihc 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
.-mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, —130 ml of whole, 
heparinized blood was collected. From males, 

— 130 ml of whole, heparinized blood was 
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collected, as well as five specimens of se: 
collected over a 6-week period. Permanwi? 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males — one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
1304/DC1). The decision of whose DNA to 
sequence was based on a complex mix of fac- 
tors, including the goal of achieving diversity as 
well as technical issues such as the quality of 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quality plas- 
mid libraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA. DNA from 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (33). 

In designing the DNA-sequencing pro- 
cess, we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored ef- 
fectively (Fig. 2) (34). 

Current sequencing protocols are based on 
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the dideoxy sequencing method (35), which 
typically yields only 500 to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
reads per day. The DNA-sequencing facility is 
^supported by a high-performance computation-. 
. al facility (36). . - ^ 

. : The process for DNA sequencing was mod- 
ular by design and automated. Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and outputs 
of each module have been carefully 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drosophila project in May 
1999. -The "ABI 3700 is a fully automated 
capillary array sequencer and as such pan 
be operated ■ with a minimal amount of 
: hands-on time, currently estimated at about 
15 min per day. The capillary system also 
* facilitates correct associations of sequenc- 
ing traces with samples through the elimi- 
' nation of manual sample loading and lane- . 
; tracking 'errors associated with slab gels. - 
About 65 production staff were hired and 
trained, and were rotated on a regular basis 



^fcgh the four production modules. A 
e«mal laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
, supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation before 
''implementation, and production-scale testing 
- of any process changes. 

1.2 Trace processing 

' An automated trace-processing pipeline has 
: been developed to process each sequence file 
v (37). After quality and vector.lximming, the 
average trimmed sequence length was 543 
bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (26). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read , for any sequence 
with'a significant match to a contaminant was 
. discarded. A total of 713 reads matched E. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 

The importance of the base-pair level ac- 
curacy of the sequence data increases as the , 
size and repetitive nature of the genome to . 
be sequenced - increases. / Each sequence 
read must be placed uniquely in the ge- 



Table 1. Celera-generated data input into assembly. 



Individual 




Number of reads for different insert libraries 




Total number of 






50 kbp 


Total 


base pairs 




2 kbp 


10 kbp 


A 


0 


0 


2,767,357 


2,767,357 


1,502,674,851 


B 


11,736,757 


7,467,755 


66,930 


19,271,442 


10,464,393,006 


C 


853,819 


881,290 


0 


1,735,109 


942,164.187 


D 


952,523 


1,046,815 


0 


1,999,338 


1,085,640.534 


F 


0 


1,498,607 


0 


1,498,607 


813,743,601 


Total 


13,543,099 


10,894,467 


2,834,287 


27,271,853 


14,808,616.179 


A 


0 


0 


0.52 


0.52 




B 


2.20 


1.40 


0.01 ; 


3.61 




C 


0.16 


1.17 


0 


0.32 




D 


0.18 


0.20 


0 


0.37 




F 


0 


. 0.28 


0 


0.28 




Total 


2.54 


2.04 


0.53 


5.11 




A 


0 


0 


18.39 


18.39 




B 


2.96 


11.26 


0.44 


14.67 




C 


0.22 


1.33 


0 


1.54 




0 


0.24 


1.58 


0 


1.82 




F 


0 


2.26 


0 


2.26 




Total 


3.42 


16.43 


18.84 


38.68 




Average 


1,951 bp 


10,800 bp 


50,715 bp 






Average 


6.10% 


8,10% 


14.90% 






Average 


74.50 


80.80 


75.60 







No. of sequencing reads 



Fold sequence coverage 
(2.9-Cb genome) 



Fold clone coverage 



Insert size* (mean) 
Insert size* (SD) 
% Matest 



•Insert size and SD are calculated from assembly of mates on contigs. \% Mates is based on laboratory tracking of sequencing runs. 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (26). . By collecting data for the 



entire human genome in a single facility, - 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary. We describe in this section the two 
approaches that we used to assemble the ge- 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
. ments to a region or chromosome on the basis 
of mapping information. The clustered data 
were then shredded and subjected to computa- 
tional assembly.. Both approaches provided es- 
- sentially the same reconstruction of assembled 
DNA sequence -with proper order and orienta- 
\tjon. The. second - method provided slightly 
' greater sequence coverage (fewer gaps) and 
■ was the principal sequence used for the analysis 
■ . phase. In addition, we document the complete- 
ness and correctness of .this assembly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed in compliance with standard operating proce- 
dures, with a focus on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, products, 
quality control measures, and responsible parties are indicated and are 
described further in the text 
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and provide a comparison to the public gen' 
sequence, which was reconstructed largel> j 
an independent BAC-by-BAC approach- Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the "genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the —25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads into the final set of reported scaffolds. 
This set of unincorporated reads is termed 
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^^.1 Assembly data sets 



We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera.- This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we 
were able , to characterize the range of insert 
;;sizes in each library and determine a mean and 
standard deviation. Table 1 details the number 
. of reads, sequencing coverage, and clone cov- 
. erage achieved by the data set The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
. nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X,and 18.84X forthe2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. 

- The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 
BAC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence. 
The data for each BAC is deposited at one of 
. . four levels of completion- Phase 0 data are a set 

. of ^ generally - unassembled sequencing reads 

chaff," and typically consisted of reads from v from a very light shotgun of the BAC, typically 
withmmgWyi^titive regions, data from other - less than IX. Phase 1 data are unordered as- 
organisms introduced through various routes as semblies of contigs, which we call BAC contigs 
found in many genome projects, and data of or bactigs. Phase 2 data are ordered assemblies 
poor quality or with untrimmed vector. of bactigs. Phase 3 data are complete BAC 
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•es. In the past 2 years the PFP has 
on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (38), filtered for a 25-bp 
match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; (ii) the nonhuman-portion. 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes (18). 

2.2 Assembly strategies 

Two different approaches to assembly were 
pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2 X covering o£the bactigs. This 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 

Table 2. GenBank data input into assembly. 
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at least 22% of the BACs contained sequence 
data that were not part of the given BAC (41), 
possibly as a result of sample-tracking errors 
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(see below). In short, we performed a true, ab 
initio whole-genome assembly in which 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

' In the compartmentalized shotgun assembly 
(CSA), Celera and PFP data were partitioned 
into the largest possible chromosomal segment! 
or "components" that could be determined with 
confidence/and then shotgun assembly was ap- 
plied to each partitioned subset . wherein the 
bactig data were again shredded into faux rcadi 
to ensure an independent ab initio assembly of 
the component By subsetring the data in this 
way, the overall -computational effort was re- 
duced and the effect of interchromosomal dupli- 
cations was ameliorated This also resulted in a 
reconstruction of the genome that was relatively 
independent of the whole-genome assembly re- 
sults so that the two assemblies could be com- 
pared for consistency. The quality of the parti- 
tioning -into ^components was ^crucial so that 
different genome regions were not mixed to- 
gether. We constructed components from (i) ihc 
longest -.scaffolds of the sequence from cadi 
BAC and (ii) assembled scaffolds of data unique 
to Celera's data set The BAC assemblies were 
obtained by a combining assembler that used the 
bactigs and the 5 X. Celera data mapped to those 
bactigs as input This effort was undertaken as 
an interim step solely because the more accurate 
and complete the scaffold for a given sequence 
stretch, the more accurately one can tile these 
scaffolds into contiguous components on the 
basis of sequence overlap and mate-pair infor- 
mation. We further visually inspected and cu- 
rated the scaffold tiling of the components to 
further increase its accuracy. For the final CSA 
assembly, all but the partitioning was ignored, 
and an independent, ab initio reconstruction o 
the sequence in each component was obtained 
by applying our whole-genome assembly algo- 
rithm to the partitioned, relevant Celera data and 
the shredded, faux reads of the partitioned, rci; 
evant bactig data. 

2.3 Whole-genome assembly 

The algorithms used for whole-gcnomc as- 
sembly (WGA) of the human genome were 
enhancements to those used to produce inc. 
sequence of the Drosophila genome reported 

in detail in (28). . . 

The WGA assembler consists of a pipeline 
composed of five principal stages: Scrccncr 
Overlapped Unitigger, Scaffplder, and RcfH- ■■ 
Resolver, respectively. The Screener Una 
and marks all microsatellite repeats with lc> 
than a 6-bp element, and screens out » 
known interspersed repeat elements, inciu ■ 
ing Alu, Line, and ribosomal DNA. Market 
regions get searched for overlaps whercn. 
screened regions do not get searched, but c 
be part of an overlap that involves unscrccnu 
matching segments. 
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The Overlapper compares every i 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 
* days in elapsed time with 40 such machines 
operating in parallel. 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
**true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early - 
in the process. 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled con tigs) . Formally, these unitigs are 
the uncontested interval subgraphs of the 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The cUscriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6x simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at- a certain distance and : 
orientation with respect to each other, the 
•probability .of . this .being wrong ''. is again 
roughly 1 in.10 10 , assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate .pairs producing intermediate- v 
sized scaffolds that are then recursively i 
linked .together by/Confu*ming. 50-kbp mate •. * 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase . pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small . 
. sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence within a v 
genome. 

. ; For the Drosophila assembly, we engaged . 
in a three-stage repeat .resolution strategy \ 
where each , stage . was progressively more , 



t 

ggress 



5.1 1X Celera Reads 
39X mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
•reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
/the. probability of inserting a unitig into an 
incorrect gap with this strategy to be less than 
10" 7 based on a probabilistic analysis. 
* We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
-more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
simulated shotgun data set of human chromo- 
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Fig. 4. Architecture of Celera's two-pronged assembly strategy. Each oval denotes a computation 
process performing the function indicated by its label, with the labels on arcs between ovals 
describing the nature of the objects produced and/or consumed by a process. This figure 
summarizes the discussion in the text that defines the terms and phrases used. 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled BAC data that cover 
the gap. We call this external gap "walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- 
spersed elements whose quality was only 
99.62% correct We decided that for the hu- 
man genome it was philosophically better not , 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 

At the final stage of the assembly process, 
and also at several intermediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present In the event that no Celera - 
data cover a given region, the BAC data ■,. 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- . 
structing subroutines. In addition, memory was • 
a . real issue — a straightforward application of : 
the software we had built for Drosophila would \ 
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• have required a computer with a 600-gigabyte 
RAM. By making the Overiapper and Unrtigger 
incremental, we were able to achieve the same 
computation with a maxim um of instantaneous 
usage of 28 gigabytes of RAM Moreover, the 
incremental nature of the first three stages al- 
. , lowed us to continually update the state of this 
.part of the computation as data were delivered 
■ and then perform a 7-day run to complete Scaf- 
:j folding . and .Repeat Resolution whenever de- 
' . sired For our assembly operations, the total 
v compute infrastructure consists of 10 four-pro- 
. cessor SMPs with 4 gigabytes of memory per 
. cluster (Compaq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
. total compute for a run of the assembler was 
. roughly 20,000 CPU hours. 

* The .assembly of Celera's data, together 
. with the shredded bactig date, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
: ; set of reads not incorporated in the, assembly, 
numbered 11.27 million (26%), which is con- 
sistent with our experience for. Drosophila, 
. . More, than 84% of the genome was covered by 
v scaffolds . >>1 00 kbp long, and these . averaged 
91% sequence and 9% gaps with a total of 
•v 2.297- Gbp of sequence. There were a total of 
; 93,857 gaps among the 1637 scaffolds >100 .. 

kbp. The average scaffold size, was 1.5 Mbp, 
v the average contig size was 24.06 kbp, arid the . . 
average gap size was 2.43 kbp, where the dis- 



' tribution of each was essentially exponential. 
More than 50% of all gaps were less than 500 
bp long, >62% of all gaps were less than 1 kbp 
long, and no gap was >100 kbp long. Simile 
ly, more than 65% of the sequence is in contigs 
>30 kbp, more . than 31% is in contigs >100 

. kbp, and the largest contig was 1.22 Mbp long. 

- Table 3 gives; detailed , summary statistics for 
■the structure of this assembly- with a direct 
comparison to the compartmentalized shotgun 
assembly. 

2.4 Compartmentalized shotgun 
assembly 

- In addition to the WGA approach, we pur- 
^ sued a localized assembly approach that was 
. intended to subdivide the genome into seg- 
ments, each' of. which could be shotgun as- 
sembled individually. We expected that this 
•would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 

. mentalized assembly process involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase , regions of the genome, 
and then running the WGA assembler on the 
Celera -data ' and / shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CSA strategy was to 
separate Celera reads; into those that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that did not match any public 
data. Such matches, must be guaranteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 
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(including intrascaffold gaps) 
• No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps ^1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 

No. of bp in scaffolds 
. (including intrascaffold gaps) 
No. of bp in contigs 
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No. of contigs 
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Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 
(bp) 
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% of total contigs 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27.27 million 
reads, 20.76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
Of the remaining reads, 2.92 million were 
completely screened out and so could not be 
matched, but the other 2.97 million reads had . 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5. 1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold . 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups -indicative of un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
have not been mapped to consistent positions 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and -orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 
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^Jmbly took place, but not enough Celera 
data were matched to truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem- 
bly, and localization of the Celera reads. The 

- phase 0 data suggest that a combined whole- 
genome shotgun data set and 1 X light-shot- 
gun of BACs will not yield good assembly of 
BAC regions; at least 3X light-shotgun of 
each BAC is needed. . .. . .. v ■■• 

i The. 5.89. million. Celera fragments not 
* matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence, and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
. forwarded along with all scaffolds produced 
by the combining - assembler, to the subse- 
quent tiling phase. 

- At this stage, we typically had one or two 
: . scaffolds for every . BAC region constituting 
.." at least 95% of the relevant sequence, and a 

- collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- , 
ponents was to determine the order and over^ , 
lap tiling of these BAC and Celera-unique 
scaffolds across the genome. For this, we 

. used Celera's 50-kbp mate-pairs information, . 
.and BAC-end pairs (18) and sequence tagged 
-site (STS) markers (44) to provide long- 

- range guidance and chromosome separation. 
Given the relatively manageable number of 

-* scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



C ^B> r contaminating sequence (from 
anoine^part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
- effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region. 

WGA assembly.of the components result- 
ed in a set of scaffolds totaling 2:906 Gbp in 
span and consisting of 2.654 Gbp of se- 
. quence. The chaff, or set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
-kbp. The average scaffold size was 1.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponential. As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
. size ranges. Consider also that 'more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were < 1 kbp, and all 
gaps are < 100 kbp long. Similarly, more than 
73% of the sequence is in contigs > 30 kbp, 
more than 49% is in contigs >100 kbp, and 
the largest contig was 1.99 Mbp long. Table 3 
-provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA), we. compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. 
Thus, another analysis was conducted in . 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they. 
were confirmed by other matches having a * 
consistent order and orientation. This gives 
some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure. , F 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the ^ 
rfull length of the overlap implied by the 
-.matching segments. An initial set of candi- 
dates was identified automatically, and then 
" each candidate was inspected by hand. From . , 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
is in error and why. 

In addition, we evaluated local inconsis- 
tencies of order or orientation. The following . 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely > 1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
. assembly of a gigabase-sized problem. When 

* one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 

* formation loss between the two is remarkably 
small. Because CSA was logistically easier to 

- deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. ■ 

2.6 Mapping scaffolds to the genome 

The final step in assembling the genome was to 

* order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 

. CSA These grouped scaffolds were reordered 
by exaniining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
. cal mapping data. This step depends on having ; 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- , 
ers. There are two genome-wide, types of map 
information available: high-density STS maps 
and fingerprint maps of BAC clones developed 
at Washington University (45). Among the ge-- 
nome-wide. STS maps, GeneMap99- (GM99) 
has the most markers and therefore was most . 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable * 
long-range order, because the framework mark- . 
ers were derived from well-validated genetic 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 
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Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes t the percent of total 
sequence is indicated. 



. . In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
• folds. Only 1% of the STS markers on the 10 

• largest scaffolds (those >9 Mbp) were 
: mapped . on a different .chromosome on 

GM99. Two percent of the STS markers dis- 

• * * agreed in position by more than five frame- 
\ work, bins/ However, - for the- fingerprint 

■ • maps, a 2% chromosome. discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold sequence disagreed 
with fingerprint map placement by more than 
five BACs. . When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10 
.scaffolds; indicating this there is variation in 
1 the quality of either the map or the scaffolds. 

- All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 

^ analysis, and showed the same low discrep- 
ancy rate to.GM99,and thus we. concluded 

■ that the fingerprint map global-order in these 
cases was not reliable. Smaller scaffolds had 

, a higher discordance rate with GM99 (4.21% 
_of STSs were discordant by more than five 
framework bins), but a lower discordance rate 

- with the fingerprint maps (11% of BACs 
disagreed with fingerprint maps by more than 

. five BACs). This observation agrees with the 
clone coverage analysis (46) that Celera scaf- 
fold construction was. better \ supported by 

- long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed between GM99 and the 

: .WashU BAC map, we had a high degree of 

• confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple "mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% : 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be "unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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with GM99. These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was ordered unambiguously. 

Next, all scaffolds that could be placed, 
but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same BAC cannot be 
ordered relative to .each other, but can be 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, ~98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the . 
chromosome. 

- During the . scaffold-mapping effort, we en- 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
1 16,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 
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VRsembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly.- This cannot be 
known with absolute certainty until the eu- 
; chromatin , sequence * has been completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 2 1 and 22 (48, 49); 
and (iii) analysis of the percentage of an 
independent set of random sequences (STS 
markers) contained in the assembly. The 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been completed to high quality 
and published (48, 49). Although this se- 
quence served as input to the assembler, the 
•finished sequence was shredded into a shot- 
gun data set so that the "assembler had the 
opportunity to assemble it differently from 
the original sequence in the case of structural 
polymorphisms or assembly errors in the 
BAC data. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally multimega- . 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the - flexibility to assemble 
"finished" sequence, differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each, containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness . .tjfeasure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(51) to the scaffolds. Because these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
z pleteness. ePCR (53) and BLAST (54) were 
used to locate STSs on the assembled genome. 
. We found 44,524 (91%) of the STSs in the 
mapped genome. An additional .2648 markers 
. (5.4%) were found by searching the uhas- 
■■ -sembled /data or "chaff". We identified 1283 * 
■. STS markers (2.6%) not. found in either Celera 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
. 93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method. We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome, through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
.'structural and sequence accuracy of the as- 

• sembly. Because the source sequences for the 
Celera data and the GenBank data are from 

. different individuals, we could not directly 
' compare the consensus sequence of the as-. 

•Table 4. Summary of scaffold mapping. Scaffolds 
nwere mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and CM99. Ordered scaf- 
. folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 

* maps, but their, placements were adjacent to a 
- neighboring anchored -or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment The scaffold subcategories are given 
below each category. 
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sembly against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se-. 
quencing reads should be located on the con- 
. sensus sequence with the correct separation 
. and orientation between the pairs. A pair is . 
.termed "valid" when the reads are in the 
. correct orientation and the distance between 
them is within the mean ± 3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the • 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined as described . 
above. To validate these, we examined all , 
reads mapped to the finished sequence of 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
merism (two different segments of. the ge-^ 
nbme cloned into the same plasmid), and how 
tight the distribution of insert sizes was for 
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those that were correct (Table* 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
• length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
. tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(—10%). Thus, although the mate-pair infor- 
, - mation was not perfect, its accuracy was such 
V that measuring valid, misoriented, .and mis-.- 
. . separated pairs with respect to a given assem- 
. bly was deemed to be a reliable instrument 
. - for validation purposes, especially when sev- . 
, eral mate pairs confirm or deny an ordering. 
• The clone coverage of the genome was 
- 39X, meaning that any given base pair was, . 
. on average, contained in 39 clones or, equiv- 
alently, spanned, by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly, by valid mate pairs (Table 6). In : . 
summary, for scaffolds >30 kbp in length, 
. less than 1% of the Celera assembly .was in **, 
: regions of less than 3 X clone coverage. Thus, * 
more than 99%;of the assembly, including 
order and orientation, is. strongly supported 
. by this measure alone. 

We examined the locations and number of 
.* all misoriented and misseparated- mates, m- 
.-.addition to doing this analysis, on the CSA 
..assembly (as, of 1 -October 2000), we also 
performed a study of the PFP assembly as of 



.:■ 5 .September 2000 (30, 55b). In this latter 
case, Celera mate pairs had to be mapped to 
.the PFP assembly. To avoid mapping errors 
. due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
.. 6% differences. A threshold was set such that 
sets of five or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
* where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
. this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
- both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in .212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 
. 6 and 7 and Table . 6 illustrate the mate-pair 
differences and breakpoints between the two 
: assemblies.There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment of 
the genome.' The ^graphic . comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to of mate pairs tested). If the two mates had incorrect relative orienta- 
the published sequence of chromosome 21. Each mate pair uniquely tion or placement, they were considered invalid (number of invalid mate 
mapped was evaluated for correct orientation and placement (number pairs). 
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r more breakpoints for the PFP assembly than 
:* for the Celera assembly. Figure 7 shows the 
t breakpoint map (blue tick marks) for both 
? assemblies of each chromosome in a side-by- 
side fashion. The order and orientation of 
\ Celera's assembly shows substantially fewer 
: breakpoints except on the two finished chro- 
; mosomes. Figure 7 also depicts large gaps 
(>10 kbp) in both assemblies as red rick 
marks. In the CSA assembly, the size of all 
gaps have been estimated on the basis of the 
mate-pair data. Breakpoints can be caused by 
structural polymorphisms, because the two 
assemblies were derived from different hu- 
man genomes. They also reflect the unfin- 
ished nature of both genome assemblies. 

3 Gene Prediction and Annotation 

Summary. To enumerate the gene inventory, 
we developed an integrated, evidence-based 
approach named Otto. The evidence used to 
increase the likelihood of identifying genes 
includes regions conserved between the 
mouse and human genomes, similarity to 
ESTs or other mRNA-derived data, or simi- 
larity to other proteins. A comparison of Otto 
(combined Otto-RefSeq and Otto homology) 
with Genscan, a standard gene-prediction al- 
gorithm, showed greater sensitivity (0.78 ver- 
sus 0.50) and specificity (0.93 versus 0.63) of 
Otto in the ability to define gene structure. 
Otto-predicted genes were complemented ; 
with a set of genes from three gene-prediction 
programs that exhibited weaker, but still sig- 
nificant, evidence that they may be ex- 
pressed. Conservative criteria, requiring at 
least two lines of evidence, were used to 
define a set of 26,383 genes with good con- 
fidence that were used for more detailed anal- 
ysis presented in the subsequent sections. 
Extensive manual curation to establish pre- 
cise characterization of gene structure will be 
necessary to improve the results from this 
initial computational approach. 

3.1 Automated gene annotation 

A gene is a locus of cotranscribed exons. A 
single gene may give rise to multiple tran- 
scripts, and thus multiple distinct proteins 
with multiple functions, by means of alterna- 



tive splicing and alternative transcription ini- 
tiation and termination sites. Our cells are 
able to discern' within the billions of base' 
pairs of the genomic DNA the signals for 
initiating transcription and for splicing to- 
gether exons separated by a few or hundreds 
of thousands of base pairs. The first step in 
characterizing the genome. is to define the 
structure of each gene and each transcription 
unit. 

The number of protein-coding genes in 
mammals has been controversial from the 
outset. Initial estimates based on reassocia- 
tion data placed it between 30,000 to 40,000, 
whereas later estimates from the brain were 
> 100,000 (56), More recent data from both 
the corporate and public sectors, based on 
extrapolations from EST, CpG island, and 
transcript density-based extrapolations, have 
not reduced this variance. The highest recent 
number of 142,634 genes emanates from a 
report from Incyte Pharmaceuticals, and is 

■ based on a combination of EST data and the 
association of ESTs with CpG islands (57). 
In stark contrast are three quite different, and 
much lower estimates: one of —35,000 genes 
derived with genome- wide EST data and 
sampling procedures in' conjunction with 
chromosome 22 data (58); another of 28,000 
to 34,000 genes derived with a comparative 

■ methodology involving sequence conserva- 
tion' between humans and the puffer fish Te- 
traodon nigrpviridis (59); and a figure of 
35,000 genes, which was derived simply by 
extrapolating from the density of 770 known 
and predicted genes in the 67 Mbp of chro- 
mosomes 21 and 22, to the approximately 
3-Gbp euchromatic genome. 

The problem of computational identifica- 
tion of transcriptional units in genomic DNA 
sequence can be divided into two phases. The 
first is to partition the sequence into segments 
that are likely to correspond to individual 
genes. This is not trivial and is a weakness of 
most de novo gene-finding algorithms. It is 
also critical to determining the number of 
genes in the human gene inventory. The sec- 
ond challenge is to construct a gene model 
that reflects the probable structure of the 
transcript(s) encoded in the region. This can 



Table 6. Cenome-wide mate pair analysis of compartmentalized shotgun (CSA) and PFP assemblies.* 







CSA 






PFP 




Genome 
library 


% 
valid 


% 

mis- 
oriented 


% 

mis- 
separated f 


% 
valid 


. % 

mis- - 
oriented 


- % 
mis- 
separatedf 


2 kbp 
10 kbp 
50 kbp 
BES 
Mean 


98.5 
96.7 
93.9 
94.1 
97.4 


0.6 
1.0 
4.5 
2.1 
1.0 


1.0 
2.3 
1.5 
3.8 
1.6 


95.7 
81.9 
64.2 
62.0 
87.3 


2.0 
9.6 
22.3 
19.3 
6.8 


2.3 
8,6 
13.5 
18.8 
5.9 



*Data for individual chromosomes can be found in Web fig. 3 on Science Online at vrtwv.sciencemag.org/egi/content/ 
futt/291/5S07/1304/DC1. fMates are misseparated if their distance is >3 SO from the mean library size. 



be done with reasonable accuracy when a 
full-length cDNA has been sequenced or a 
highly homologous protein sequence is 
known. De novo gene prediction, although 
less accurate, is the only way to find genes 
that are not represented by homologous pro- 
teins or ESTs. The following section de- 
scribes the methods we have developed to 
address these problems for the prediction of 
protein-coding genes. 

We have developed a rule-based expert sys- 
tem, called Otto, to identify and characterize 
genes in the human genome (60), Otto attempts 
to simulate in software the process that a human 
annotator uses to identify a gene and refine its 
structure. In the process of annotating a region 
of the genome, a human curator examines the 
evidence provided by the computational pipe- 
line (described below) and examines how var- 
ious types of evidence relate to one another. A 
curator puts different levels of confidence in 
different types of evidence and looks for 
certain patterns of evidence to support gene 
annotation. For example, a curator may ex- 
amine homology to a number'of ESTs and 
evaluate whether or not they can be connect- 
ed into a longer, virtual mRNA. The curator 
would also evaluate the strength of the simi- 
larity and the contiguity of the match, in 
essence asking whether any . ESTs cross 
splice-junctions and whether - the edges of 
putative exons have consensus splice sites. 
This kind of manual annotation process was 
. used to annotate the Drosophila genome. 
The Otto system can promote observed 
evidence to a gene annotation in one of two 
ways. First, if the evidence includes a high- 
quality match to the sequence of a known 
gene [here defined as a human gene repre- 
. sented in a curated subset of the RefSeq 
database (61% then Otto can promote this to 
a gene annotation. In the second method, Otto 
evaluates a broad spectrum of evidence and 
determines if this evidence is adequate to 
support promotion to a gene annotation. 
These processes are described below. 

Initially, gene boundaries are predicted on 
the basis of examination of sets of overlap- 
ping protein and EST matches generated by a 
computational pipeline (62). This pipeline 
searches the scaffold sequences against pro- 
tein, EST, and genome-sequence databases to 
define regions of sequence . similarity and 
runs three de novo gene-prediction programs. 
To identify likely gene boundaries, re- 
gions of the genome were partitioned by. Otto 
on the basis of sequence matches identified, 
by BLAST. Each of the database sequences 
matched in the region under analysis was 
compared by an algorithm that takes into 
account both coordinates of the matching se- 
quence, as well as the sequence type (e.g., 
protein, EST, and so forth). The results were 
used to group the matches into bins of related 
sequences that may define a gene and identify 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of "gene bins," each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 



being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SEM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshiits, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations . 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DC1. 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- 
tion between mouse and human genomic 
DNA, similarity to human transcripts (ESTs 
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and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 




man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
-extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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Fig. 7. Schematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the lower pair of lines represent Celera's 



assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome is indicated in black, and the chromosome numbers in red. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
. consistent gene model could be generated This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
. different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 10 bases, but the 
external edge was allowed greater latitude to . 
allow for 5' and 3' untranslated regions ^ 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the - 
complete predicted open reading frame. For r 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 



Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript tallying the number {N) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology)f 


0.604 


0.884 


Genscan 


0.501 


0.633 



•Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan predictioa fRefers to those 
annotations produced by supplying all available evidence 
to Genscan. 
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those that passed were promoted to Otto 
. predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
.directly used in making the Otto predictions. 
Otto predicted 11,226 additional genes by 
. means of sequence similarity. 

3.2 Otto validation 

. To validate the Otto homology-based process 
.and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
. ' ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
\ was a unique SDM4 alignment (Table 7). In 
order to evaluate the relative performance of 
. Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
. only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
. dieted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
. bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
. ined the sensitivity and specificity of the Otto 
. predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto . 
uses to annotate known genes (Otto-RefSeq). . 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6. 1% of true RefSeq 
nucleotides were not represented in the Otto- .. 
refseq annotations and 2.7% of the nucleotides - 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons . 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
, conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
used. For these genes, we insisted that a 
- predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe-. 
. line. For these, there, was not sufficient 
• sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
■which —76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap known genes or 
-predictions made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near me upper .limit for the human gene 
!. complement. . As seen in Table 8, if the re- 
quirement for other supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to —23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
, because it would eliminate genes that encode 
novel proteins (members 1 of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol-. 
lowing evidence, types— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383 genes are illustrated along 
. . chromosome diagrams in Fig. 1. These are a 
very preliminary set of annotations and are 
subject to all the limitations of an automated, 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- . 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas • 
those promoted from gene-prediction pro- 
grams average about 3.7 exons! The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary. This section describes several of 
the. noncoding attributes' of the assembled, 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 




4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
most visible, element of the structure of 
the genome is the banding pattern produced 

; by Giemsa ; stain.* Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 

. heterochromatin {64). .Much of this hetero- 
., chromatin is highly polymorphic and con- 
sists of different families : of alpha satellite 

■ DNAs with various higher order repeat 

, structures (6*5).' Many chromosomes have 
complex inter- and intrachromosomal du- 

, -plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 

..these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data.show the degree to which 
multiple Genscan predictions and/or; Otto annotations were associated with a -single RefSeq 
transcript.. The zero class for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions) - 



Otto 



De novo 



No. of exons per 
transcript 



Number of 

transcripts 
Number of 

exons 
Number of 

transcripts . 
Number of 

exons 
Otto 
De novo 



Total 

17,969 

141,218 

58,032 

319,935 

7.84 
5.53 



Mouse 

1 7,065 

111,174 

14,463 

48,594 

5.77 
3.17 



Types of evidence 



Rodent 
14,881 

89.569 

5,094 

19,344 

6.01 
3.80 



Protein 

15,477 

108,431 

8,043 

26,264 

6.99 
3.27 



Human 
16,374 
118,869 
9,220 
40,104 



7.24 
4.36 



No. of lines of evidence* 



1 

17,968f 

140,710 

21,350 

79,148 

7.81 
3.7 



2:2 

17,501 

127,955 

8.619 

31,130 

7.19 
3.56 



2:3 

15,877 

99,574 

4,947 

1 7,508 

6.00 
3.42 



2:4 

12,451 

59,804 

1,904 

6,520 

4.28 
3.16 



«™M W !?» 0f ^l™* (cons ^ a . tion j n 3X mouse * enomic DNA - similarity to human EST or cDNA. similarity to rodent EST or cDNA, and similarity to known proteins) were 
Zt 0 ?* 9 ?. g6 " e P re ^ ,etl ? ns fron ! the different methods - ™* use of evi^nce is quite liberal requiring only a partial match to a single exon of predicted transcript tThis 
number includes alternative splice forms of the 17.764 genes mentioned elsewhere in the text w prwiww uaracnpt Tints 
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Examination of pericentromeric regions is 
ongoing. 

The remaining —80% of the genome, the 
. euchromatic component, is divisible into G-, 
R-, and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68). Bernardi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp in length 
(69). Bernardi defined the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores fell into three G+C-rich classes rep- 
resenting 24,-8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of G+C content . 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was. 
1078.6 kbp. The correlation between G+C 
.content and gene density was also examined in 
50-kbp windows along the assembled sequence 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3 -containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 



found to have the lowest gene density, X, 4, 
18, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
. bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
.density, does not appear to be unusual in its 
H3 banding. 

• How .valid is Ohno's postulate (71) that 
^ mammalian genomes consist of oases of genes 
: in otherwise essentially empty deserts? It ap- 
pears that the human genome does indeed con- 
* tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
. gene, then we see that 605 Mbp, or about 20% 
- of the . genome, is in deserts. These are not 
k uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their , collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 18, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- * 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
. analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
ing of genes.' The distance metric, centimorgans . 
(cM), is based on the recombination rate be- 
tween homologous chromosomes during meio- 

Table 9. Characteristics of C+C in isochores. 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 

•■ abled by a nearly complete genome sequence is 
to. produce the ultimate physical map, and to 

.--fully analyze its correspondence with two other 
„ maps that have been widely used .in genome 

. and genetic analysis: the linkage map and the 

^cytogenetic map. .This would close the loop 

. .between me mapping and sequencing phases of 
the genome project 

We mapped the location of the markers 
. that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
' 3-Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
:region of the chromosomes have been previ- 
ously documented (73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates and the largest 
, difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
. rates among regions of the genome exceeds 
the differences in . recombination rates be- 
tween males and females.. The human ge- ; 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over ' 
a space of 1 kbp, so the picture one gets of the 
magnitude of \ variability in recombination 
. rate will depend on . the size of the window : 



Isochore 



C+C (%) 



Fraction of genome 



Fraction of genes 



Predicted* 



Observed 



Predicted* 



Observed 



H3 

H1/H2 
L 



>48 
43-48 
<43 



5 
25 
67 



9.5 
21.2 
69.2 



37 
32 
31 



24.8 
26.6 
48.5 



♦The predictions were based on Bemardi's definitions {70) of the isochore structure of the human genome. 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
• sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 
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have more than 20. In the de novo set 49.3% of the transcripts have one or two exons, and 0.2% have more than 20. 
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examined. Unfortunately, too few meio.„ 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
finer than about 3 Mbp. The next challenge 
will be to determine a sequence basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects. 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) , and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG. islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
on human chromosome 22 (81), Larsen et 
al. (76) and Gardiner-Garden and Frommer 
(75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 
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versus expected frequency of CG dinucle- 
otide >0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with computational definitions be- 
; cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island 
/ with gene .starts, given a set of annotated 
. genomic. transcripts and, the whole genome 
sequence. We ,have ; analyzed the publicly 
< available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et al (76). The main differences are " 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they, overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- , 
otide likelihood ratio. Besides using the orig- 
mal threshold of 0.6 (method 1), we used a 
higher threshold L of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number .of annotated genes on 
this chromosome/ The main results are sum- 
marized in Table 13. CpG islands computed -, 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene i 
starts (start codons) are contained inside a 




CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
-the corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. 

We also looked at the distribution of CpG 

:> island nucleotides among various sequence 
; classes such, as intergenic regions, introns, 
exons, and first exons. We computed the 
. likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon,>and .13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
. a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 
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4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14 We observed about 35% of 
...-the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
. itive sequence may be underrepresented in 
• the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table .10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all introniess para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11 (continued). Relation among gene density (orange), G+C content 
(green), EST density (blue), and Alu density (pink) along the lengths of 
each of the chromosomes. Gene density was calculated in 1-Mbp win- 



dows. The percent of G+C nucleotides was calculated in 100-kbp 
windows. The number of ESTs and Alu elements is shown per 100-kbp 
window. 



5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paralog 
refers to a gene that appears in more than 
one copy in a given organism as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- 
scribed (84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplication 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were suJ 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. 
. .We believe. that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84, 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- ' ■ 
era! cases of retrotransposition from a single 
source chromosome to multiple target chro- 
mosomes. Interesting examples include the 
retrotransposition of a five exon-<ontaining 
ribosomal protein L21 gene on chromosome 1 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is' 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of. genes involved in translatibnal 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue-specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 
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5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11.- Genome overview. ' 




pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genome (including gaps) 
Size of the genome (excluding gaps) 
Longest contig ' ■ 
: Longest scaffold 
Percent of A +T in the genome 
Percent of G+C in the genome 
Percent of undetermined bases in the genome 
Most CC-rich 50 kb 
Least GC-rich 50 kb 
Percent of genome classified as repeats 
Number of annotated genes 
Percent of annotated genes with unknown function 
Number of genes (hypothetical and annotated) 
Percent of hypothetical and annotated genes with unknown function 
Gene with the most exons 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by introns 
Percent of base pairs in intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergenic region" (between annotated + hypothetical eenes) 
Rate of SNP variation 



2.91 Gbp 
. . 2.66 Gbp 
-1.99 Mbp 
. 14.4 Mbp 
54 
38 
9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26,383 
42 

39,114 
59 

fitin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
605 Mbp 
25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0.36) 

Chr. 13 (3,038,416 bp) 
1/1250 bp 



•In these ranges, the percentages correspond to the annotated gene set (26, 383 genes) and the hypothetical + 
annotated gene set (39,114 genes), respectively. /hvuk:uuh_ t 

Table 12. Rate of recombination, per physical distance (cM/Mb) across the genome.- Genethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not applicable. 




Chrom. 



Sex-average 



Female 




8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
X 
Y 

Genome 



2.60 
2.23 
2.55 
1.66 
2.00 
1.97 
2.34 
1.83 
2.01 
3.73 
1.43 
4.12 
1.60 
3.15 
2.28 
1.83 
3.87 
3.12 
3.02 
3.64 
3.23 
1.25 
NA 
NA 

4.12 



1.12 
0.78 
0.86 
0.67 
0.67 
6.71 
1.16 
0.73 
0.99 
1.03 
0.72 
0.76 
0.75 
0.98 
0.94 
1.00 
0.87 
1.37 
0.97 
0.89 
1.26 
1.10 
NA 
NA 

0.88 



0.23 
0.33 
0.23 
0.15 
0.18 
0.28 
0.48 
0.14 
0.53 
0.22 
0.31 
0.26 
0.01 
0.18 
0.34 
0.47 
0.00 
0.86 
0.10 
0.00 
0.69 
0.84 
NA 
NA 

0.00 



2.81 
2.65 
2.40 
2.06 
1.87 
2.57 
1.67 
2.40 
1.95 
3,05 
2.13 
3.35 
1.87 
2.65 
2.31 
2.70 
3.54 
3.75 
2.57 
2.79 
2.37 
1.88 
NA 
NA 

3.75 



1.42 
1.12 
1.07 
1.04 
1.08 
1.12 
1.17 
1.05 
1.32 
1.29 
0.99 
1.16 
0.95 
1.30 
1.22 
1.55 
135 
1.66 
1.41 
1.50 
1.62 
1.41 
NA 
NA 

1.22 



0.52 
0.54 
0.42 
0.60 
0.42 
0.37 
0.47 
0.46 
0.77 
0.66 
0.47 
0.49 
0.17 
0.62 
0.42 
0.63 
0.54 
0.43 
0.49 
0.83 
1.08 
1.08 
NA 
NA 

0.17 



3.39 
3.17 
2.71 
2.50 
. 2.26 
3.47 
2.27 
3.44 
2.63 
2.84 
3.10 
2.93 
2.49 
3.14 
2.53 
4.99 
4.19 
4.35 
2.89 
3.31 
2.58 
3.73 
3.12 
NA 

4.99 



1.76 
1.40 
1.30 
1.40 
1.43 
1.67 
1.21 
136 
Y66 
1.51 
1.32 
1.55 
1.19 
1.63 
1.56 
232 
1.83 
2.24 
1.75 
2.15 
1.90 
2.08 
1.64 
NA 

1.55 



0.68 
0.61 
0.33 
0,77 
0.62 
0.64 
0.34 
0.43 
0.82 
0.76 
0.49 
0.59 
0.32 
0.75 
0.54 
1.12 
0.94 
0.72 
0.87 
1.34 
1.18 
0.93 
0.72 
NA 

0.32 
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that account for gene inactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the. genomic se- 
. quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted 
transcripts were excluded from this analysis. 
We identified 2909 regions -matching with . 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

We looked for - T correlations between 
structural elements and the propensity for . 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 
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pseudogenes (1177 source genes) - versus 
the remainder of the predicted gene set. 
y Transcripts that give rise to processed pseu- 
* dogenes have shorter average transcript 

length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
? . content.did not show any significant differ- . 
■\ ence, contrary to a recent report (88). There 
is a ..clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
terns. (2%). The increased .occurrence • of 
- : retrotransposition (both intronless paralogs , 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- ; ; 
al activity of these genes. 



5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
;:, human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the ' 
whole genome (2.9-Gbp sequence length) by means of two different methods. Method 1 uses a GC 
likelihood ratio of >0.6. Method 2 uses a CG likelihood ratio of >08 



Chromosome 22 



Whole genome 
(CS assembly) 



Number of CpG islands 

detected 
Average length of island (bp) 

Percent of sequence 

predicted as CpG 
Percent of first exons that 

overlap a CpG island 
Percent of first exons with 

first position of exon 

contained inside a CpG 

island 

Average distance between 

first exon and closest CpG 
. island (bp) 

Expected distance between 
first exon and closest CpG 
island (bp) 



Method 1 


Method 2 


Method 1 


Method 2 


5,211 


522 


195.706 


26,876 


390 


535 


395 


497 


5.9 


0.8 


2.6 


0.4 


44 


25 


42 


22 


37 


22 


40 


21 


1,013 


10.486 


2,182 


17,021 


3,262 


32,567 


7,164 


55,811 



Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence. 



Repetitive elements 



Alu 

Mammalian interspersed repeat (MIR) 
Medium reiteration (MER) 
Long terminal repeat (LTR) 
Long interspersed nucleotide element 
(LINE) 

Total 



.". Megabases in V 
assembled 
sequences 


Percent 
of 

assembly 


Previously 
predicted 
(%) (83) 


288 


9.9 


10.0 


66 


23 


1.7 


50 


1.7 


1.6 


155 


53 


5.6 


466 


16.1 


16.7 


1025 


353 


35.6 



•The complete -clusters that result from the 
Lek clustering provide one basis for compar- 
• ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
-' plication. Because each complete cluster rep- 
. resents a closed and certain island of homol- 
\ ogy, and because Lek is capable of simulta- 
neously clustering .protein complements of 
i several organisms, the number of proteins 
J contributed by each organism to a complete 
cluster can .be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative importance, of large-scale duplication 
versus smaller-scale, organism-specific ex- 
: pansion and contraction of protein families, 
presumably, as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 
- clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three, animal . genomes. Such ex- 
pansions would, give rise to the distribution 
that shows a . peak, at 1:1 in the ratio for 
. . human-worm or human-fly clusters with the 
: - slope spread covering both human and fly/ 
worm, predominance, as . we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
- dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family- based method that identified highly 
conserved blocks of duplication. We then 
. describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and ti k 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 
ordered by the start codons for predicted 
genes along the chromosome. We considered 
the two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster (89). All 
pairs of. indexed gene strings . were then 
aligned in both the forward and reverse di- 
rections with the Smith-Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch —10, with gap open 
and extend penalties of -4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- . 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very rapidly; for example, two chromosomes 
of 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were cpncat- 
. enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 
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^Pfcltering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
■ containing the same number of proteins as the 
true genome. This shuffled protein set has the 
, identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
; real and the shuffled data, with the results on 
the shuffled data ^beirig used .to estimate the; 
false-positive rate. Jhe algorithm after filter-.', 
ing yielded 10,310 gene pairs in 1077 dupli- . 
cated blocks containing 3522 distinct genes; 
. tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled/data, by con- 
trast, only 370. gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the .1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block : 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- ; 
cation relationships that are graphically strik- ... 
ing. One such example captured by the anal- , 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and . which has been 
analyzed for genome-deployment reconstruc- 




it several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
■.others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
• chromosomes 2 and 14 panels in Fig. 13). 

The proteins are not contiguous but span a 
^region. containing 97 proteins on chromo- 
. some 2 and 332 proteins on chromosome 14. 
>The likelihood of observing this many dupli- 
.. cated proteins by chance, even over a span of 
this length, is 2.3 X 10~ 68 (93). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
stains a block duplication that is nearly as 
. large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
.the pair of chromosome arms. This breadth of 
> duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset). This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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Fig. 12. Gene duplication in complete protein dusters. The predicted protein sets of human, worm, 
and ny were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 



pair , of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses. to .explain which mechanisms foster these 
processes must be tested 

.Evaluation of the alignment results gives 
some perspective on. da ting of the duplications.* 
.As noted above, large-scale ancient segmental 



would need to be invoked to explain the duplication in fact best, explains many of the 



relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
some 20, for a density of involved proteins of 



blocks detected by this genome-wide analysis. 
The . regions of human chromosomes involved 
in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse . 



20 to 30%.-Jhis is consistent with an ancient . .chromosomal regions are much more similar in 



large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As an independent verification 
of the significance of the alignments detect- 



sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication regions, are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or-.- 
thologous to the human genes -on which the 
human duplication assignments were made. On 
the basis, of these . factors, the corresponding 
mouse , chromosomal spans, at coarse resolu- 



ed, it can be seen that a substantial number of ( don, appear to be products of the same large- 



the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
. into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar generic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
appear to predate the two species* divergence. , 
■ = This dates the duplications, at the latest, before - 
- divergence of the primate and rodent lineages. , 
: .This date can be further refined upon examina- 
. tion of the synteny between human chromo- 
somes and those of chicken, pufferfish (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
. regions.. ; When the synteny of these regions^ 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of. 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



• veal the stagewise history of our genome, and 

with it a history of the emergence of many of 

■ the key functions that distinguish us from other 
living things. 

. 6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
to identify single-nucleotide polymorphism* 
(SNPs) by comparison of the Celera sequence 
to other SNP • resources. The SNP rate be- 
. tween two chromosomes was ~1 per 1 200 to 

• 1500 bp. SNPs are distributed nonrandomly 
throughout the genome. Only a very small 

■ proportion of all SNPs (<1%) potentially 
'impact protein function based on the func- 
tional analysis of SNPs that affect the pre- 
dicted coding regions. This results in an es- 
timate that only thousands, not millions, of 

• genetic variations may contribute to the struc- 
- rural diversity of human proteins. 

■r\ Having a complete genome sequence enables 
. 'researchers to achieve a dramatic acceleration 
. in the rate of gene discovery, but only through 
; analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
, sequencing is a particularly effective method 
• for detecting sequence variation in tandem with 
whole-genome assembly. In addition, we com- 
.pared the distribution and attributes of SNPs 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (P7), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
■ '."TSC*; 632,640 SNPs) (98). These data were 
v -consistent in showing an overall nucleotide di- 
versity of -8 X 10" 4 , marked heterogeneity 
across the genome in SNP density, and an 
overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details ncccs- . 
skated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtering 
step, we monitored the ratio of transition and 
transversion substitutions, because a 2 : 1 ratio 
has been well documented as typical in mam- 
malian evolution (100) and in human SNPs 
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(101, 102). The filtering steps consisted of i 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
uon-to-transversion ratio from 1.57:1 to 
1 .89 : 1 . When applied to 2.3 Gbp of alignments 
between the Celera and PEP consensus se- 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs from a total of 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those found by 
other methods are described below. 

6.2 Comparisons to public SNP 
databases 

Additional SNPs, including 2,536,021 from 
dbSNP (www.ncbi.nim.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103). The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded A total of 
2,336,935 dbSNP variants were mapped to 
1,223,038 unique locations on the Celera se- 
quence, implying considerable redundancy in 
dbSNP. SNPs in the TSC set mapped to 
585,8 1 1 unique genomic locations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PFP, TSC 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fraction of SNPs identified by one of 
these methods was also found by another meth- 
od The very high overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overlap 
(16.4%) between the Kwok and TSC sets is due 



Jin j : ****** of SNPs from genome-wide 
SNP databases. Table entries are SNP counts for 

each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs divided by the number of SNPs 
■n the smaller of the two databases compared 
Total SNP counts for the databases are: Celera- 
PFP, 2,104,820; TSC, 585,81 1; and Kwok 438,032 
Only unique SNPs in the TSC and Kwok data sets 
were included. 
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to their being the smallest two sets. In addition 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro-' 
' vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of - assessing whether the 
three sets of SNPs provide the same picture 
of human variation . is to tally the frequen- 
cies .of. the six/possible base changes in : 
each.set of SNPs.fTable 16). Previous mea- 
sures of -nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (101) t and our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale. 
.There is remarkable homogeneity between 
the SNPs found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
in. this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the . 2 : 1 . transition-to- 
transversion ratio observed in the other 
SNP sets. . This result is not unexpected, 
-because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors/ 
A 2:1 transitionrtransversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
■ normalize these values to the chromosome 
size and sequence coverage, we used it, the 
standard statistic for nucleotide diversity 
(1 04). Nucleotide diversity is a measure, of - . 
per-site heterozygosity,, quantifying the 
probability that a pair of chromosomes 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 
quencing, we need to know the sequence- ' 
quality and the depth of coverage at each 
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siiTTfhese data are not readily available, so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity from, high-quality sequence 
overlaps .should be possible, but again, 
* more information, is needed on the details 
of all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
- column of the multialignment, the probability 
mat two or more distinct.alleleslare present, 
and the probability of detecting a SNP if m ' ' 
fact the alleles have different sequence (i.e., 
the probability of correct sequence calls). The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (105). Even 
after correcting for variation in coverage, the 
nucleotide diversity, appeared to vary across 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
, estimates of tt for 100-kbp windows to esti- 
■ mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29 73 P < 
0.0001), 

. . ■ Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 .X Nucleotide diversity on 

; .the X chromosome was 6.54 X 10~ 4 . The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drift will more rapidly remove variation 
from the X (106). 

Having ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples, of genes were reasonably 
accurate (101, 102, 106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X for the Celera-PFP alignment, 
and a published, estimate averaged over 10 
densely resequenced human genes was 
8.00 X 10" 4 (108). 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 



Table 16. .Summary of nucleotide changes in different SNP data sets. 
SNP data set 



TSC 



Kwok 



Celera-PFP 
TSC 



188,694 
(0.322) 



158,532 
(0.362) 

72.024 
(0.164) 



Celera-PFP 

Kwok* 

TSCf 



A/C 
(%) 


C/T 
(%) 


A/C 
(%) 


A/T 
(%) 


30.7 


30.7 


10.3 


8.6 


33.7 


. 33.8 


8.5 


7.0 


33.3 


33.4 


8.8 


7.3 



C/G 
(%) 

9,2 
8.6 
8.6 



T/G 

m_ 

10.3 
8.4 
8.6 



Transition; 
transversion 

1.59:1 
2.07:1 
1.99:1 
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Fig. 13. Segmental dupli 
tions between chromo- 
somes in the human ge- 
nome, The 24 panels show 
the 1077 duplicated blocks 
of genes; containing 10310 
pairs of genes in total Each 
line represents a pair of ho- 
nx>logous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
.- duplications between a 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
dose-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is \ 
greater than expected by chance. If SNPs 
occur by random and independent mutations; • - 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in . 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical formulation called the neutral 
coalescent (1 09); Applying well-tested algo- > 
. rithms for simulating the neutral coalescent 
with recombination (110), and using an.ef- i 
fective population size of 10,000 and a per- . 
base recombination rate equal to the mutation . 
rate (11 1), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
. observed distribution of SNPs has a much larg- * , 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may affect the local density of SNPs, . in- 
cluding the rate at which DNA. polymerase 
makes errors and the efficacy of mismatch , 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the - 
mutation rate of CpGs over other dinucle- •-, 
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-otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 

- across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 

.but G+C content accounted for only a 

* small part of the variation. 

6.5 SNPs by genomic class 

.-To: test homogeneity of SNP "-densities 
across functional classes, we partitioned 
-sites into . intergenic (defined as >5 kbp 
. from any predicted transcription unit), 5'- 
UTR, exonic (missense and silent), in- 
-tronic,. and 3',-UTR for ,.10,239 known 
genes, derived from the NCBI.RefSeq da-, 
tabase and all human genes predicted from 
vthe Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
.missense to silent coding SNPs in Celera- 
PFP,' TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
■ duced frequency of missense variants com- 
pared with the, neutral, expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios are com- 
parable, to the missense-to-silent ratios of. 
0.88 arid 1/17 found by Cargill et al (101) 
and by Halushka et al. (102).. Similar re-, 
suits were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 
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Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black. Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
. counts hr Celera-PFP, TSC, and Kwok 
SNPs, respectively; Nonconservative pro- 
. * tein changes constitute an even smaller frac- 

* tion of missense . SNPs (47, 41, and 40% in 
: Celera-PFP, Kwok, and TSC). Intergenic re- 

• gjons have been virtually unstudied (113), and 
we. note that 75% of the* SNPs we identified 

. were intergenic (Table 17). The SNP rate was 
. highest in introns and lowest in exons. The SNP 
rate, was lower in intergenic * regions than in 
introns, providing one of the first discrirriinators 
between these two classes of DNA. These SNP 

- rates were confirmed in the Celera SNPs, which 
also exhibited a lower rate in exons than in 
introns,- and in extragenic - regions than in in- 
trons (46). Many of these intergenic SNPs will 

-provide valuable information in the form of 
. . markers for linkage and association studies, and 
-.some fraction is likely to have a regulatory 
. function as well.* 

.7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent differences and similarities 
when the human genome is compared with 
.other, fully, sequenced eukaryotic genomes. 
Over 40% of the predicted protein set in 
humans cannot, be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain- based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
.worm genomes. Prominent among these are 
domain expansions in proteins involved in 

- developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. . 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
. predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are subject to several limitations. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
been built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
tions (some human genes will not be.computa- 
tionally predicted). We also expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of. 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth- 
ods? (ii) What are. the core functions that 
appear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular, functions of the predicted 
26,588 human proteins that have M Jeast 
two lines of; supporting evidence.. About I. 
41% (12,809) of . the; gene products could, 
not be classified from this initial analysis 
and are termed .proteins with unknown - 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein , families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the ■ 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins . 
as possible. These functional predictions 
are based on .similarity to sequences of 
known function. 

In our analysis of the 12,73 1 additional low- * 
confidence predicted genes (those with only 
one piece of supporting ' evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
"methods. One-third of these 636 predicted 
genes represented endogenous retroviral pro-/ 
teins, further suggesting that the majority of'.* 
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these unknown-function genes are not real 
genes. Given that most of these additional 
12,095 genes appear to be unique among the 
genomes sequenced to date, many may simply 
.represent false-positive gene predictions. 
■ The most common molecular "functions are 
; .the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme). 
Other , functions that are highly represented in 
the human genome are the receptors, kinases, 
"1 and hydrolases. Not. surprisingly, most of the 
. hydrolases are proteases. There are also many 
. proteins that are members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 
. cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
> phosphatases. 

Table 17. Distribution of SNPs. in classes of 
genomic regions. 



Genomic region 
class 


. Size of 
' region 
examined 
(Mb) 


: Celera-PFP 
SNP 
density 
(SNP/Mb) 


Intergenic 


2185 


707 


Gene (intron + 


646 


917 


exon) 






Intron 


615 


921 


First intron 


164 


808 


Exon 


31. 


529 


First exon 


10 


592 



cell adhesion (577, 1.9%) 
miscellaneous (1318, 4.3%) \ chaperonc(l59,0.5%) 
viral protein ( 1 00, 0 3%\ \ \ \ ^oskelctal structural protein (876. 2.8%) 

transfer/carrier protein (203, 0.7%) 
transcription factor ( I 850, 6.0%) 



nucleic acid en^me (2308, 7.5%) 



signaling molecule (376, 1.2%) 



receptor (1 543, 5.0%) 



kinase (868, 22%) 



select regulatory molecule (988, 3.2%) 



transferase (610, 2.0%) 
synthase and synthetase (313,1 .0%) 
oxidorcductase (656, 2. 1 %) 
lyase (I 

ligase 

tsomerasc(163,OJ%) 

hydrolase (1227, 4.0%) 




extracellular matrix (437, 1.4%) 
.immunoglobulin (264, 0.9%) 
ion channel (406, 13%) 
motor (376, 1.2%) 

structural protein of muscle (296\ 1 .0%) 
protooncogene (902, 2.9%) 

select calcium binding protein (34. 0.1%) 
intracellular transporter (350, 1.1%) 

transporter (533, 1.7%) 



GO categories 



:(II7.0.4%)^ / 
ligase (56, 0.2%) ' 



molecular function unknown ( 1 2809, 4 1 .7%) 



Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
. ment to molecular 
function categories in 
the Gene" Ontology 
(CO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 



Panther categories 
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7.2 Evolutionary conservation of core 
processes 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of the evolution of the human ge- 
nome. The genomes of S. cerevisiae (*T>ak- 
ers' yeast") {118) and two diverse inverte- 
brates, C. elegans (a nematode worm) (119) 
and D. melanogaster (fly) (26), as well as the 
• first plant genome, A. thaliana, recently com- 
pleted (92), provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
. served between human and fly, and between, 
human and worm (Fig. 16) to address the 
question, What are the core functions that . 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- . 
served protein set"), and therefore are likely • 
to perform similar conserved functions in the . 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 
a duplication event) because paralogs may -. 
subsequently diverge in function. Following ■ 
the yeast- worm ortholog comparison in . 
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^ (120), we identified two different cases for 
each pairwise comparison (human-fly and 
human-worm). The first case was a pair of 
.v. genes, one from each organism, for which 
.there was no other close homolog in either 
organism. These are straightforwardly identi- 
y ;.fied as orthologous, because there are no 
„. additional members of the families that com- 
-.. plicate separating orthologs from paralogs. 

-The second case is a family of genes with 
: -more than one member in either or both of the 
- .organisms being compared. Chervitz et ah 
; (120) deal with this case by analyzing a 
vphylogenetic tree that described the relation-. 

* .ships between all of the sequences in both 
organisms, and. then looked for pairs of genes 

/that were nearest neighbors in the tree. If the 
nearest-neighbor pairs, were from different , 
organisms, those genes were presumed to be 

- orthologs. We note that these nearest neigh- 
bors can often be confidently identified from , 
pairwise sequence comparison without hav-. 
-ing to examine a phylogenetic tree (see leg- 

- end to Fig. 16). If the nearest neighbors are 

• not from different organisms, there has been . 

, a paralogous expansion in one or both organ- • 
isms after the speciation event (and/or a gene 

. loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 

.. tational overview of the predicted human pro- . 
tein set, we could not answer this question for 
every predicted protein. Therefore, we con- 



- - sider only. "strict orthologs," i.e., the proteins 

- with unambiguous one-to-one relationships 
(Fig.- 16). By these criteria, there are 2758 

■ strict -human-fly orthologs, 2031 human- 
worm (1523 in common between these sets). 
; We define the evolutionarily conserved set as 
those M 523 human proteins that have strict 

;-Corthologs in :boih^D... melanogaster and **£.- 

■ elegans. . 

■ . The distribution of the functions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. .15 shows that, not 
, surprisingly, the set of conserved proteins is 
,.not distributed among molecular functions in 

• the same way as the whole human protein set. 
.•Compared with the whole human set (Fig. 
.. 1 5),- there are several categories that are over- 
'. represented in the conserved set by a factor of 

- —2 or more. The first category is nucleic acid 

- enzymes,, primarily the : transcriptional ma- 

• chinery - (notably ; DNA/RNA methyltrans- 

• ferases, DNA/RNA polymerases, helicases, 
. .. DNA ligases, DNA- - and RNA-processing 

factors, nucleases, and ribosomal proteins). 
-The basic transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved among the animals. - 
Other enzyme types are also overrepresent- / 
-ed. (transferases, oxidoreductases, ligases, 
^ lyases,- and isomerases). Many of these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs** between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hrts (780) such that each 
orthologous pair (i) has a 
BIASTP P-value of <10~ 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e., there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs! By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
resented in the shared protein set. Proteases 
form the largest part of this category, and 
several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. The major conserved families are 
: small guanosine triphosphatases (GTPases) 
(especially the Ras-related superfamily, in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The , 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]./ 
These observations provide only a conserva- , 
rive estimate of the protein .families in the. 
context of , specific cellular processes that . 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
thologs difficult within the members of con- , 
served protein families. 




7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- ■ 
ular functions, protein families, and protein 
domains. - .. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared , 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. . 
We have found that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One of the most 
striking differences .between the human ge- 
: nome and the Drosophila or C, elegans ge- 
nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). .This 
is expected,, because the acquired immune 
response is a defense system that onlyroccurs 
in vertebrates. We observe 22 class I and 22 
; class i majorvvhistocompatibility ^complex 
: (MHG) antigen; genes and .1 14 qther.;immuT 
• noglqbulin genes in ^thel human. genome. In, 
addition, there are 59 genes in the .cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to constitute molecules such as 
MHC, and of the integrin fold to form several 
; of the cell adhesion molecules that mediate. 

interactions between; immune effector cells ' 
. and, the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family, of secreted 4-alpha helical 
. bundle proteins, . namely the cytokines . and . 
.chemokines.^Some of the cytoplasmic signal * 
, transduction components associated with cy- 
tokine, receptor signaL.transduction.are also f 
features that are poorly represented in the fly - 
and worm/ These include . protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine. signaling (SOCS), and protein inhibi- 
tors of .activated STATs (PIAS). In contrast, :^ 
many of the animal-specific protein domains - 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 
, Neural development, structure, and . 
function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein families .that .are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment., Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 
mans.These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 

• of intercellular channels and the structural 
; basis for electrical coupling.- Pathway. find- 
ing by. axons and neuronal network forma- 
. - tion is mediated through a subset of ephrins 

; and their cognate receptor tyrosine kinases 

• that act as positional labels to establish 
-topographical projections (123). The prob- 
able biological role for the semaphorins (22 

- in human compared with 6 in the fly and 2 
i in the worm) and their receptors (neuropi- 
•.. lins and plexins) is that ofaxonal guidance 
-. molecules (124). Signaling molecules such 

as neurotrophic factors and some cytokines 

- have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). 'Notch receptors and ligands play 

; . important roles in glial cell fate determina- 
• tion and gliogenesis (126), 
. Other human expanded gene families play 
key roles directly, in neural structure and 
. function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the. invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca z sensor (or receptor) during synaptic 
; vesicle fusion and release (127). Of interest is 
the increased co-occurrence in humans of 
PDZ and the SH3 domains in neuronal- 
specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions (128).. We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
= (related to cyclic nucleotide gated channels), 
the ;voltage-gated . : calcium/sodium channel 
family,- me inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18, Domain-based comparative analysis of proteins in H. sapiens (H), 
D. melanogaster (F). C etegans (W). S. cerevistae (Y), and A. thaliana (A). The 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e.. SH2) are listed in 



more than one cellular process. Results of the Pfam analysis may differ from 
.results obtained based on human curation of protein families, owing to the 
limitations of large-scale automatic classifications. Representative examples 
of domains with reduced counts owing to the stringent E value cutoff used for 
this analysis are marked with a double asterisk (**). Examples indude short 
divergent and predominantly alpha-helical. domains, and certain dasses of 
cysteine-rich zinc finger proteins. , 



Accession 
number 



Domain name 



Domain description 



H 



W 



PF02039 

PF00212 

PF00028 

PF00214 

PF01110 

PF01093 

PF00029 

PF00976 

PF00473 

PF00007 

PF00778 

PF00322 

PF00812 

PF01404 

PF00167 

PF01534 

PF00236 

PF01153 

PF01271 

PF02058 

PF00049 

PF00219 

PF02024 

PF00193 

PF00243 

PF02158 

PF00184 

PF02070 

PF00066 

PF00865 

PF00159 

PF01279 

PF00123 

PF00341 

PF01403 

PF01033 

PF00103 

PF02208 

PF02404 

PF01034 

PF00020 

PF00019 

PF01099 

PF01160 

PF00110 

PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
, PF00039 
PF00040 
PF00051 
PF01823 
PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



Adrenomedullin 
ANP 

Cadherin 
Calc.CGRPJAPP 
CNTF 
-..Gusterin 
Connexin 
ACTH_domain 
CRF 

Cys_knot 
DIX 

Endothelin 

Ephrin 

EPhJbd 

FGF 

Frizzled 

Hormone6 

Clypican 

Cranin 

Cuanylin 

Insulin 

ICFBP 

Leptin 

Xlink 

NGF 

Neuregulin . 
HormoneS 
NMU 
Notch 

Osteopontin 

Hormone3 

Parathyroid 

Hormone2 

PDCF 

Sema 

Somatomedin_B 

Hormone 

Sorb 

SCF 

Syndecan 

TNFR.C6 

TGF-0 

Uteroglobin 

Opiods_neuropep 

Wnt 

ANATO 
Clq 

Disintegrin 

F5_F8_type_C 

COLFI 

Fnl 

Fn2 

Kringle 

MACPF 

Pentaxin 

SAA_proteins 

Sushi 

TSPN 

Tissue.fac 

TransglutaminJM 

Transglutamin_C 



, • Developmental and homeostatic 

Adrenomedullin 

Atrial natriuretic peptide 

Cadherin domain 

Calcitonin/CCRP/IAPP family 

Ciliary neurotrophic factor 

Clusterin 

Connexin 

Corticotropin ACTH domain 

Corticotropin-releasing factor family 

Cystine-knot domain 

Dix domain 

Endothelin family 

Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Clypican 

Grainin (chromogranin or secretogranin) 

Guanylin precursor 

Insulin/IGF/Relaxin family 

Insulin-like growth factor binding proteins 

Leptin 

LINK (hyaluron binding) . 

Nerve growth factor family ... 

Neuregulin family 

Neurohypophysial hormones . 

Neuromedin U 

Notch (DSL) domain 

Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 

Stem cell factor 

Syndecan domain 

TNFR/NGFR cysteine-rich region 

Transforming growth factor 0-like domain 

Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Writ family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-like domain 

C1q domain 

Disintegrin 

F5/8 type C domain 

Fibrillar collagen C-terminal domain 

Fibrqnectiri type f domain V 

Fibronectin type II domain 

Kringle domain 

MAC/Perforin domain 

Pentaxin family 

Serum amyloid A protein 

Sushi domain (SCR repeat) 

Thrombospondin N-terminaWike domains 

Tissue factor 

Transglutaminase family 

Transglutaminase family 



regulators 

1 
2 

100(550) 
3 
1 
3 

14 (16) 
1 
2 

10(11) 
5 
3 

7(8) 
12 
23 
9 
1 

14 
3 
1 
7 

10 
1 

13(23) 
3 

. . 4 
1 
1 

3(5) 
1 
3 
2 

5(9) 
5 

27 (29) 
.5(8) 
1 
2 
2 
3 

17(31) 
27 (28) 
3 
3 
18 



0 
0 

14(157) 
0 
0 
0 
0 
0 
1 
2 
2 
0 
2 
2 
1 
7 
0 
2 
0 
0 
4 
0 
0 
0 
0 
0 
0 
0 

2(4) 
0 
0 
0 
0 

1 

8(10) 
3 
0 
0 
0 
1 
1 
6 
0 
0 

7(10) 



0 
0 

16(66) 
0 
0 
0 
0 
0 
0 
0 
4 
0 
4. 
1 
1 
3 
0 
1 
0 
0 
0 
0 
0 
1 
0 
0 
0 
0 

2(6) 
0 
0 
0 
0 
0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
. 0 

b 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



6(14) 


0 


0 


0 


0 


24 


0 


0 


0 


0 


18 


2 


3 


0 


0 


15(20) 


5(6) 


2 


0 


0 


.10 


.. o .... 


0 ' 


0 


, o 


5 (18) , 


- 0. 


o 


o . 


0 


11(16) 


0 


0 


0 


0 


15(24) 


2 


2 


0 


0 


6 


0 


0 


0 


0 


9 


0 


0 


0 


0 


4 


0 


0 


0 


0 


53 (191) 


11(42) 


8(45) 


0 


0 


14 


1 


0 


0 


0 


1 


0 


0 


0 


0 


6 


1 


0 


0 


0 


8 


1 


0 


0 


0 
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Accession 
number 



Domain name 



Domain description 



H 



PF00594 


G(a 


PF00711 


Defensin_beta 


PF00748 


Calpainjnhib 


PF00666 


Cathelicidins 


PF00129 

• 


MHCJ 


PF00993 


MHCJLalpha** 


PF00969 


MHCJI.beta** 


PF00879 


Defensin__propep 


PF01109 


GM_CSF 


PF00047 


lg " 

o 


PF00143 


Interferon 


PF00714 


IFN-gamma 


PF00726 


IL10 


PF02372 


IL15 


PF00715 


IL2 


PF00727 


IL4 


PF02025 


IL5 


PF01415 


IL7 


PF00340 


IL1 


PF02394 


IL1_propep 


PF02059 


IL3 


PF00489 


IL6 


PF01291 


LIF.OSM 


PF00323 


Defensins 


PF01091 


PTN_MK 


PF00277 


SAA_proteins 


PF00048 


IL8 



PF01582 



TIR 



PF00229 


. TNF 


PF00088 


Trefoil 


PF00779 


BTK 


PF00168 


C2 


PF00609 


DAGKa 


PF00781 


DAGKc 


PF00610 


DEP 


PF01363 


FYVE 


PF00996 


GDI 


PF00503 


C-alpha 


PF00631 


G-gamma 


PF00616 


RasGAP 


PF00618 


RasGEFN 


PF00625 


Guanylate kin 


PF02189 


ITAM 


PF00169 


PH 


PF00130 


DAGJ>E-bind 


PF00388 


PI-PLC-X 


PF00387 


PI-PLC-Y 


PF00640 


PID 


PF02192 


P)3K_p85B 


PF00794 


PI3K_rbd 


PF01412 


ArfGAP 


PF02196 


RBD 


PF02145 


Rap.GAP 


PF00788 


RA 


PF00071 


Ras 


PF00617 


RasGEF 


PF00615 


RGS 


PF02197 


Rlla 



Vitamin K-dependent carboxylation/gamma- 
carboxyglutamic (GLA) domain 

Immune response 

Beta defensin 
Catpain inhibitor repeat - 
Cathelicidins ... 
Class I histocompatibility antigen, domains alpha 1 
• and 2 

Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleuktn-10 

lnterleukin-15 

lnterleukin-2 

lnterleukin-4 

lnterleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

lnterleukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia inhibitory factor (UF)/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family . 
Trefoil (P-type) domain 

Pi-PY-rho CTPase signaling 

BTK motif 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
GDP dissociation inhibitor 
. G-protein alpha subunit 
G-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacylglycerol binding domain (C1 
domain) 

Phosphatidylinositol-specific phospholipase C, X 
domain 

Phosphatidylinositol-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-kinase family. p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 



11 



1 

3(9) 
2 

'18 (20) 

5(6) 
7 
3 
1 

381 (930J 
7(9) 
1 
1 
1 
1 
1 
1 
1 
7 



0 
0 
0 

0 
0 
0 
0 

125 (291) 
0 
0 
0 
0 
0 
0 
0 
0 
0 



1 


0 


1 


0 


2 


0 


2 

» 


0 


z 


0 


2 


0 


4 


0 


32 


0 


18 


• 

8 


■'15 - 


0. 




V 


5 


1 




OJ (AA\ 


9 


4 


10 


8 


12(13) 


4 


28 (30) 


14 


6 


2 


27(30) 


■10 


16 


5 


11 


5 


9 


2 


12 


8 


3 


0 


33(212) 


72 (78) 


45 (56) 


25(31) 


12 


3 


11 


2 


24 (27) 


13 


2 


1 


6 


3 


16 


9 


6(7) 


4 


5 


4 


18(19) 


7(9) 


126 


56(57) 


21 


8 


27 


6(7) 


4 


1 
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Accession 
number 



Domain name 



PF00620 
PF00621 
PF00536 
. PF01369 
PF00017 
PF00018 
PF01017 
PF00790 
PF00568 

PF00452 
PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 

PF00022 

PF00191 

PF00402 

PF00373 

PF00880 

PF00681 

PF00435 

PF00418 

PF00992 

PF02209 

PF01044 



RhoGAP 

RhoCEF 

SAM 

Sec7 

SH2 

SH3 

STAT 

VHS 

WH1 

Bd-2 

BH4 

CARD 

Death 

DED 

BAG 

ICE_p20 

BIR 

Actin 

Annexin 

Calponin 

Band_41 

Nebulin_repeat 

Plectin_repeat 

Spectrin 

Tubulin-binding 

Troponin 

VHP 

Vinculin 



PF01391 


Collagen 


PF01413 


C4 


PF00431 


CUB 


PF00008 


EGF 


PF00147 


Fibrinogen_C 


PF00041 


Fn3 


PF00757 


Furin-like 


PF00357 


lntegrin_A 


PF00362 


Integrinji 


PF0OO52 


. Lamininji 


PFfX)053 


Laminin_EGF 


PF00054 


Laminin_C 


PF00055 


Laminin_Nterm 


PF00059 


Lectin_c 


PF01463 


LRRCT 


PF01462 


LRRNT 


PF00057 . 


Ldl_recept_a 


PF00058 


Ldl_receptJ> 


PF00530 


SRCR 


PF00084 


Sushi 


PF00090 


Tsp_1 


PF00092 


Vwa 


PF00093 


Vwc 


PF00094 


Vwd 



Domain description 



H 



W 



PF00244 

PF00023 

PF00514 

PF00168 

PF00027 

PF01556 

PF00226 

PF0O036 

PF00611 

PF01846 

PF00498 



14-3-3 
Ank 

Armadillo_seg 
C2 

cNMP_binding 

DnaJJT 

DnaJ 

Efhand** 

FCH 

FF 

FHA 



RhoGAP domain 
RhoGEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain . 

. Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain 
WH1 domain 

Domains involved in apoptosis 
Bd-2 . ^ 

Bd-2 homology region 4 

. Caspase recruitment domain 

Death domain 

Death effector domain 

Domain present in Hsp70 regulators 

ICE-like protease (caspase) p20 domain 

Inhibitor of Apoptosis domain 

. Cytoskeletat 

Actin 
Annexin 

Calponin family 

FERM domain (Band 4.1 family) 
Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family 

; . BCM adhesion 

Collagen ir'tpie helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
EGF-like domain 

Fibrinogen beta and gamma chains, C-terminal 
globular domain 

Fibronectin type HI domain 

Furin-like cysteine rich region 

Integrin alpha cytoplasmic region 

Integrins, beta chain 

Laminin B (Domain IV) 

Laminin EGF-like (Domains (H and V) 

Laminin G domain 

Laminin N-terminal (Domain VI) 

Lectin C-type domain 

Leucine rich repeat C-terminal domain 

Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A . 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Willebrand factor type D. domain 

Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadillo/beta-catenin-like repeats 
C2 domain . 

Cydic nucleotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 





19 


20 


9 - 


8 


AC 


- 23 (24) 


18(19) 


3 


0 


29 (31) 


15 


8 


3 


6 


13 


5 


5 


5 


9 


07 tnc\ 

87(95) 


33 (39) 


44(48) 


1 


3 


143(182) 


55 (75) 


. 46 (61) 


23 (27) 


4 


7 


1 


1(2) 


0 


0 


4 


C 


4 


4 


8 


7 

9 




Z(3J 


1 


0 


9 


2 


1 


0 


. 0 


"3 


0 


1 


. o . 


0 


16 


0 


2 


0 


0 


16 


5 


7 


0 


0 


4pJ 


0 


0 


0 


0 


5(8) 


3 


2 


1 


5 


n 


7 


3 


0 


0 


ft (1A\ 


5(9) 


2(3) 


1(2) 


0 


61 (64) 


15(16) 


12 


9(11) 


24 


16(55) 


4(16) 


4(11) 


0 


6(16) 


13(22) 


3 


7(19) 


0 


0 


. 29(30) 


17(19) 


11(14) 


0 


0 


4(148) 


1(2) 


1 


0 


0 


2(11) 


0 


0 


0 


0 


31 (195) 


13(171) 


10(93) 


0 


0 


4(12) 


1(4) 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (130). Humans 
have at least 10 genes belonging to four, 
different families involved in myelin produc- 

Table 18 [Continued) 
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tion (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely. related members of the 
MOG family.. Flies have only a single myelin 
proteolipid, and worms have none at all. . 
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FKBP-type peptidyl-prolyl cis-trans isomerases 
GAF domain 
Kelch motif 
Leucine Rich Repeat 
MATH domain 
PAS domain 

PDZ domain (Also known as DHR or GLGF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin, CBP/p300 

Nuctear interaction domains 

A20-like zinc finger 
ARID DNA binding domain 
BAH domain 
B-box zinc finger 

BRCA1 C Terminus (BRCT) domain 
Bromodomain 
BTB/POZ domain 

C-5 cytosine-specific DNA methytase 
chromo* (CHRromatin Organization Modifier) 
domain 

Core histone H2A/H2B/H3/H4 
Cyclin 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
GATA zinc- finger 
G-patch domain 

Helix-loop-helix DNA-binding domain 
Histone deacetylase family 
Homeobox domain . 
IPT/TIG domain 
JmjC domain 
JmjN domain 
KH domain 
KRAB box 

Ligand-binding domain of nuclear hormone 

receptor 
UM domain containing proteins 
MATH domain 

Myb-like DNA-binding domain 
Myc leucine zipper domain 
MYND finger 
PHD-finger * 

Pou domain — N-terminal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.Jca. RRM, RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 
7(8) 
54(157) 
25 (30) 
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18(19) 
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10 
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28(67) 
204 (243) 
47 

62(129) 
11 

32 (43) 
1 
14 
68(86) 
15 
7 

224 (324) 
15 

44(51) 
10 
17(19) 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
-humans relative .to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 
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Accession 
number 



Domain name 



Domain description 



H 
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PF0O352 

PF00567 
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PF00098 



Zf-TAZ 
TEA 

Zf-TRAF 
TBP. 

TUDOR 

Zf-CCCH 

Zf-C2H2** 
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Transcription factor TFIID (or TATA-binding 

protein, TBP) 
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Zinc finger, C2H2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 
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(Tables i8and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are, 
• enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-p (TGF-0), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
.toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human - 
ephrin genes (2 in the fly, 4 in the worm) and 12 
ephrin receptors ;(2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in , 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(757). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (752), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (755). A similar expansion in humans 
is noted in structural proteins that constitute the 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



Comparison across the. five sequenced eu- , 
rkaryotic organisms revealed several expand- 
ed protein families and domains involved in . 
cytoplasmic signal transduction (Table 18). 
In particular, - signal . transduction \ pathways : 

- v playing roles in developmental regulation and v 
acquired immunity were substantially en- 

. . riched. There is a factor of 2 or greater ex- ■ 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator end GTP 

- exchange factors associated with them. Al- 
though there .are about the same number of * 
tyrosine kinases in the human and C. elegans , 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
in.phosphotyrosine signal .transduction. Fur- .'• 
ther, there. is. a twofold expansion- of phos- 

t phodiesterases in the human .genome. com- * 
• pared with either the worm or fly genomes. 
.- The downstream effectors of the intracellu- 
■ lar signaling molecules include the transcription, 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
, binding nuclear hormone receptor class of tran- 
. .scription factors compared with the .fly genome, ; 
. -.although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not . only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- . 
: mains, which are not found in the fly or worm 
genomes. These, domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



> homec>domains alone /or ' in combination with 

> Pou and LEM domains in all of the animal 
i genomes.: in plants; however, a different set of 
. transcription factors are expanded, namely, the 
• myb family, and a unique set that includes VP1 

and AP2 domain-Kxmtaining proteins (754). 
-.The yeast genome has a paucity of transcription 
factors compared with -the multicellular eu- 
■ karyotes,-and its repertoire . is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation 
While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
most, of the protein domains are highly con- 
served; An interesting ( Observation is that 
.worms and humans have .approximately the 
-same numberof both tyrosine kinases and 
7 serme/threonine kinases (Table 19). It is im- 
. portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
• that contain these domains also display a 
Lwide. repertoire of interaction domains with 
- ..significant combinatorial diversity. 

.-:;Hemostasis.- Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular a_dhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive re- 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multi domain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metaUo- 
proteases) (Table 19). Proteolysis of extracel- 
lular matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ease, and a variety of inflammatory conditions 
(755, 136). ADAMs are a family of integral 
membrane proteins with a pivotal role in fibrin- : 
ogenolysis and modulating interactions : be- 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-a, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway (135). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some , of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
regulatory enzymes (137). We enumerated 
the protein counts of central adaptor and ef- 
fector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and BcI2 are represent- 
ed in the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector dornain-containing proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
oxygenase-activating proteins (four in humans) 
may be vertebrate-specific. Lipoxygenases are 
involved in arachidonic acid metabolism, and 
they and their activators have been implicated 
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in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
surprising human expansions, however, is in 
the number .of glyceraldehyde-3 -phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 3. in the fly, and 4 in the worm). There 
is, however, evidence for many rerrotrans- 




posed GAPDH pseudogenes (139), which 
may account for this apparent expansion. 
However, =it is interesting that GAPDH, long 
known as a, conserved enzyme involved in 
basic metabolism found across all phyla from 
. bacteria to humans, has recently been shown 
to have other functions. It has a second cat- 
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0 

0 

0 

0 

0 

1 

0 
0 
0 
0 
0 

b 

0 
0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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alytic activity, as a uracil DNA glycosylase 

(140) and functions as a cell cycle regulator - 

(141) and has even been implicated in apo- 
ptosis (142), . 

Translation, Another striking set of hu- 
man expansions has occurred in certain fam-.. 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
-that each have at least 10 copies in the;ge- " 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in; 
the number of genes relative to either the - 
worm or fly. •Retrotransposed pseudogenes : 



The Human Genome 

may. account for many of these; expansions 
[see the discussion above and (143)], Recent 

; evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
dent of their involvement in protein biosyn-- 

.; thesis; for example, LI 3a and the related L7 

-.subunits (36 copies in humans) .have been 

/, shown to induce apoptosis (144), 

~ : ^There is also a four-.to fivefold expansion . 

• .in v the > elongation ' factor - l-alpha.;. ; family 
(eEFIA; 56 human genes). .Many, of these ' 
expansions likely, represent intronless para- 

-.logs that have presumably arisen from retro- - 
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Panther family/subfamily* 



H 



W 



MHC dass I 

MHC dass II 

Other immunoglobulinf 

Toll receptor-related 



22 
20 
114 
10 



0 
0 
0 
6 



0 
0 
0 
0 



Signaling moleculesf 
Calcitonin 
Ephrin 
FGF 

Glucagon 

Glycoprotein hormone beta chain 
Insulin 

Insulin-like hormone 

Nerve growth factor 

Neuregulin/heregulin 
. neuropeptide Y 

PDGF 

Relaxin 

Stannocatcin 

Thymopoeitin 
Thyomosin beta 
TGF-p 
VEGF 
Wnt 
Receptorsf 
Ephrin receptor 
FGF receptor 
Frizzled receptor 
Parathyroid hormone receptor 
VEGF receptor 

BDNF/NT-3 nerve growth factor 
receptor 

Dual-specificity protein phosphatase 
S/T and dual-specificity protein 

Jdnasef 
S/T protein phosphatase 

Y protein kinase f 

Y protein phosphatase 

ARF family 

Cydic nucleotide phosphodiesterase . . 

G protein^coupled receptorsf J 

G-protein alpha 

G-protein beta 

G-protein gamma 

Ras superfamily 

G-protein modulatorsf 

ARF GTPase-activating 

Neurofibromin 

Ras GTPase-activating 

Tuberin 

Vav proto-oncogene family 



Developmental and homeostatic regulators 



3 
8 

24 
4 
2 
1 
3 
3 
6 
4 
1 

3 

2 

2 

4 
29 

4 
18 

12 

4 
12 

2 

5 

4 



0 

2 

1 

0 

0 

0 

0 

0 

0 

0 
1 

0 

0 

0 

2 

6 

0 

6 

2 
4 
6 
0 
0 
0 



0 

4 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 
4 
0 
5 

i 

0 
5 
0 
0 
0 



Kinases and phosphatases 



0 
0 
0 
0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

6 

0 - 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 



0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 

b 

0 
0 
0 
0 
0 
0 
0 
0 
0 - 

0 
0 
0 
0 
0 
0 



29 


8 


10 


,4 


- 11 


395 


198 


315 


114 


1102 


15 


19 


51 


13 


29 


106 


47 


100 


5 


16 


56 


22 


95 


5 


6 


Signal transduction 








55 


. 29 


27 


12 . 


45 


25 


8 


:■ 6 


•■■ 1 ..*■'■ 


0 


. 616 


146 


284 


0 


. 1 


27 


10 


22 


2 


5 


5 


3 


2 


1 


1 


13. 


2 


2 


0 


0 


141 


64 


62 


26 


86 


20 


8 


9 


5 


15 


7 


2 


0 


2 


0 


9 


3 


8 


1 


0 


7 


3 


2 


0 


0 


35 


15 


13 


3 


0 



• transposition, and again there is evidence that 
many of these may be pseudogenes (145). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
•expression in skeletal muscle and a comple- 
mentary expression partem to the ubiquitous- 
, iy expressed eEFIA (146). 
: . -.Ribqnucleoproteins, ^Alternative . splicing 
results- in: multiple, tra^criptslfrom ^single 
gene, and can .therefore^ generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the worm, two times that of the fly, and 
about the same as the. 265 identified in the 
Arabidopsis genome. Whether the diversity 
of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA.protein (148). 
. Tyrosylprotein sulfotransferases participate 
in the posttranslational modification of pro- 
teins involved in inflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate to the. prominent differences in - 
the immune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (ISO). Evolution of apparently new 
(from the perspective of sequence analysis) 
protein domains and increasing regulatory 
. complexity by domain accretion both quanti- 
■ tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that we observe in humans. Perhaps 
the best illustration of this trend is the C2H2 
zinc finger-containing transcription factors 
where we see expansion in the number of 
domains per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of internal ribosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify the full 
extent of this process in the human genome 
(151). At the posttranslational level, although 
we provide examples of expansions of some 

in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 
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Panther family/subfamily* 




C2H2 zinc finger-containingf 
CREB 

ETS-related 
Forkhead-related 

Croucho 
Histone HI 
Histone H2A . 
Histone H2B 
Histone H3 
Histone H4 

Homeoticf 
ABD-B 

Bithoraxoid 
Iroquois class 
. Distal-less 
Engrailed 
UM-containing 
MEIS/KNOX class 
NK-3/NK-2 class 
Paired box 
Six 

Leucine zipper 

Nudear hormone receptorf 

Pou-related 

Runt-related 



Transcription factors/chromatin organization 



8 Conclusions 

8.1 The whole-genome sequencing 
approach versus BAG by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (15, 80, 152) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those ofDrosophila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the 
map (in terms of the order of the markers) is 
more important than the number of markers 
per se. Although this mapping could have 
been performed concurrently with sequenc- 
ing, the prior existence of mapping data was 
beneficial. During the sequencing of the A 
thaliana genome, sequencing of individual 
BAC clones permitted extension of the se- 



Cadherin 
Ctaudin 

Complement receptor-related 
Connexin 
Galectin : 

Glypican 
ICAM 

Integrin alpha 
Integrin beta 
• LDL receptor family 
Proteoglycans 

Bcl-2 
Calpain 

Calpain inhibitor 
Caspase 

ADAM/ADAMTS 

Fibronectin 

Globin 

Matrix metalloprotease 
Serum amyloid A 
Serum amyloid P (subfamily of 
Pentaxin) 

Serum paraoxonase/arylesterase 
Serum albumin 

Transglutaminase 

Cytochrome p450 
GAPDH 

Heparan sulfotransferase 
EM alpha 

Ribonucleoproteinsf 
Ribosomal proteins} 



607 
7 
7 
25 
. 34 
8 

•V 13 . 
5 
24 
21 
28 
9 
168 
5 
1 
7 
5 
2 
17 
9 
9 
38 
5 
6 
59 
15, 
3 

ECM adhesion 

113 
20 
22 
14 
12 
13 
6 
24 
9 
26 
22 

Apoptosis 

12 
22 
4 
13 

Hemostasis 

51 

3 
10 
19 

4 

2 



232 
1 
1 
8 
19 
2 

2 

0 

1 
1 

2 
1 

104 

0 

8 

3 

2 

2 

8 

4 

4 
28 

3 
0 
25 
5 
4 



17 
0 
8 
0 
5 
2 
0 
7 
2 

19 

9 



4 
4 
10 

Other enzymes 

60 
46 
11 



1 
4 
0 
7 

9 
0 
2 
2 
0 
0 

0 
0 

1 



89 
3 
4 



Splicing and translation 



56 
269 
812 



13 
135 
111 



79 
1 
2 
10 
■ 15 
1 
1 
1 
17 
17 
24 
16 
74 
0 
1 
1 
1 
1 
3 
4 
5 
23 
4 
0 
183 
4 
2 

16 
0 

6 

0 
22 

1 

0 

4 

2 
20 

7 

0 
11 
0 
3 

12 
0 
3 
7 
0 
0 

3 
0 
0 



83 
4 
2 

10 
104 
80 



28 
0 
0 
0 
4 
0 
0 
0 
3 
2 
2 
1 
4 
0 
0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

1 

1 
0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 
0 

0 
0 
0 
0 
0 
0 

0 
0 
0 

3 
3 
0 



6 
60 
117 



8 
0 
0 
0 
0 
0 
0 
0 
13 
12 
16 
8 
78 
0 
0 
0 
0 
0 
0 
26 
0 
2 
0 
0 
4 
0 
0 



0 

0 

0 

0 

0 

0 

0 

1 

0 
2 
5 



0 
3 
1 
0 

0 
0 
3 
3 
0 
0 

0 
0 
0 



256 
8 
0 

13 
265 
256 



families in the same Panther molecular hJ££fiS£ , ?? represent$ a numb « r of diffcrent 

class, and metabotrople glutamate-dass CPCRs. ******* |Thu co . unt ,ndudej modopsirntos. secretin- 
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quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, in Drosophila, the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific applica- 
tions of BAC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in,, 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAC clone se- 
quencing phase.. Our experience with h uman 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAC shotgun sequence data. 



8.2 The low gene number in humans 

We have sequenced and assembled —95% of 
the euchromatic sequence , of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever , 
the reasons for this current disparity, only, 
detailed annotation, comparative genomics 
(particularly using the Mus mus cuius ge- 
nome), and careful molecular dissection of 
complex "phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5 '-untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
{153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



- predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
.- mRNA in specific cell types to demonstrate 
. the presence of a gene. 
^ J. B. S. Haldane speculated in 1937 that a 
- population of organisms might.have to pay a 
V* price for .the number of genes - it can possibly 
. - carry. He theorized.that when the number of 
genes becomes too. large, each zygote carries ^ 

* so many new deleterious mutations that the 
population simply cannot, maintain itself. On 
the basis of this premise, and on the basis of 

. available mutation rates, and x-ray-induced 
mutations at specific loci, Muller, in 1967. 
{154% calculated that .the mammalian ge- 
, . nome would contain a maximum of not much 
more man 30,000 genes (755). An estimate of 
. . 30,000 gene loci for humans was also arrived 
. at by Crow and Kimura {156). Midler's esti- 
mate for/), melanogaster-was 10,000 genes,- 

* . compared to 13,000 derived by annotation of 
. the fly genome (25, 27). These arguments for 

the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a . certain low rate of 
-..mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 

v The, -modest... number of human genes . 
means that we must look elsewhere for the . 
, « mechanisms . that generate .the complexities 
inherent in human development and the so- 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the functions of. individual ; 
genes and gene products are regulated. The 
. degree of "openness" of chromatin structure 
- and hence transcriptional activity is regulated : 
► by. protein complexes mat involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally, important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses {157); meth- 
ylation of CpG islands in imprinting {158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements {159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules {160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 
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of. RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clinical and biological relevance {161). Final- 
ly, examples of translational control include 
internal ribosomal entry sites that are found 
in proteins involved in cell cycle regulation 
and apoptosis {162). At the protein level, 
minor alterations in the , nature of protein- ' 
protein interactions, . protein ^modifications, - 

- . and localization can have dramatic effects on 
cellular physiology {163). This dynamic sys- 
tem therefore has many .ways to modulate 
activity, which suggests that definition of 

- complex systems by analysis of single genes 
is unlikely to be entirely successful. 
Tn situ studies have shown that the human 
;. genome is - asymmetrically populated with 
, G+C content, CpG islands, and genes {68). 
: , However, the genes are not distributed quite 
: >. as unequally* as had been predicted (Table 9) 
{69). The most G+C-rich fraction of the ge- 
.:. nome,.H3 isochores, constitute more of the. 

genome than previously thought (about 9%), 
.and are the most gene-dense . fraction, but 
. contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
; make up 65% of the genome, and 48% of the 
. genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
■ cation" of the vertebrate genome (77). Why 
> are. there clustered regions of high and low 
. gene . density, and are these accidents of his- 
: tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
.possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many. species of bats have genome 

- sizes .that are much smaller than that of hu- 
, mans; for example, Miniopterus, a species of 
-Italian bat, .has a genome size that is only 

50% that of humans {164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is -70% that of humans. 



8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human' population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modern human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, and admix-, 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies genetic variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism — sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the . 
population as there are autosomal chromo- 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
ila, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 




8.4 Genome complexity 

We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the complex 

moves there " (167) to the exciting area 

of network perturbations, . nonlinear re- 
sponses and thresholds, and their.pivotal 
role in human diseases. 

The enumeration -of other "parts lists" re- 
veals that in organisms .with complex nervous 
systems, neither gene number, neuron number, : 
nor number of cell types correlates in any' 
meaningful manner, with even, simplistic mea-. 
sures of structural ,or r .. behavioral ^complexity. 
Nor would they be expected to; this is the realm 
•of nonlinearities and epigenesis (168). The 520 
million neurons of the common octopus exceed . 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on the mouse and 

human, and from comparative mammalian neu- . 
roanatomy (169\ that the morphological and .- 
behavioral diversity found, in mammals is un- 
- derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, .when oner 
compares a pygmy marmoset (which is only 4 
inches tall and weighs. about 6. ounces) to a 
,; chimpanzee, the brain volume of this minute 
. primate is found to be only about 1 .5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans.' Yet .. 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- - 
ganizations, and cell types and neuroanatomies - 
are almost indistinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- . 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination of the number of neu- . : 
rons, cell types, or genes or of the genome < 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF^, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 Beyond single components 

While few would disagree with the intuitive 
■conclusion that Einstein's -brain was more 
-complex than that of Drosophi la, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila\ and if so, to what 
- degree, are not straightforward, since protein, 
...protein domain, or protein-protein interaction 
...measures do not. capture, context-dependent 
. .interactions that underpin, the' dynamics:.un- 
derlying phenotype. 

• .■ Currently, there are more' than 30 different 
mathematical descriptions of complexity (170). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis , of biological sys- 
tems, which are composed of nonidentical ele- 
ments (proteins, protein complexes, interacting 
cell , types, .and . interacting neuronal popula- 
; tions), is through graph theory (171). The ele- 
* ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them. 
. Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This .robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene, knockouts provide an 
illustration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious pheno- 
typic effects (772), and yet the usually conspic- 
: uous . vimentin network is completely absent 
On the other hand, —30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity" particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open up new strategies for hu- 
man biological research and would have a 
• major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt. 
This assembly of the human genome se- 
. quence is but a first, hesitant step on a long 
. t and. exciting journey toward understanding, 
the role of the genome in human biology. It 
has been possible only because of innova- 
tions in instrumentation and software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- 
. notation. The next steps are clear: We must 
. define the complexity that ensues when this 
relati vely modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- * 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
.... between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
. public discussion of this information and its 
potential for improvement of personal health. ' 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
are "hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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pair ij in the context of the graph as a whole by 
simply dividing the number of BLAST hits shared in 
common between the two proteins by the total 
number of proteins hit by 7 and/. This simple metric 
has several interesting properties. First, because the 
similarity metric takes into account both the simi- 
larity and the differences between the two sequenc- 
es at the level of BLAST hits, the metric respects the 
multidomain nature of protein space. Two multido- 
main proteins, for Instance, each containing do- 
mains A and B, will have a greater pairwise similarity 
to each other than either one will have to a protein 
containing only A or B domains, so long as A-B- 
containing multidomain proteins are less frequent in 
the proteome than are single-domain proteins con- 
taining A or B domains. A second interesting prop- 
erty of this similarity metric is that it can be used to 
produce a similarity matrix for the proteome as a 
whole without having to first produce a multiple 
alignment for each protein family, an error-prone 
and very time-consuming process. Finally, the met- 
ric does not require that either sequence have sig- 
nificant homology to the other in order to have a 
defined similarity to each other, only that they 
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share at least one significant BLAST hit in common. 
This is an especially interesting property of the 
metric because it allows the rapid recovery of pro- 
tein families from the proteome for which no mul- 
tiple alignment is possible, thus providing a compu- 
tational basis for the extension of protein homology 
searches beyond those of current HMM- and profile- 
based search methods. Once the whole-proteome 
similarity matrix has been calculated, lek first par- 
titions the proteome into single-linkage clusters 
(27) on the basis of one or more shared BLAST hits 
between two sequences. Next, these single^linkage 
clusters are further partitioned into subdusters, 
each member of which shares a user-specified pair-- 
. wise similarity with the other members of the clus- 
ter, as described above. For the purposes of this 
publication, we have focused on the analysis of 
. single-linkage, dusters and what, we have termed 
"complete dusters,' e.g„ those . subdusters for 
which every.member has a similarity metric of 1 to 
every other member of the subduster. We believe 
that the single-linkage and complete dusters are of 
spedal interest, in part, because they allow us to 
estimate and to compare sizes of core protein sets 
in a rigorous manner. The rationale for this is as 
follows: if one imagines for a moment a perfect 
dustering algorithm capable of perfectly partition- 
ing one or more perfectly annotated protein sets 
into protein families, it is reasonable to assume that 
the number of dusters will always be greater than, 
or equal to. the number of single-linkage dusters, 

• because single-linkage dustering is a maximally ag- ; 
glomerathve dustering method. Thus, if there exists 
a single protein in the predicted protein set contain- 
ing domains A and B, then it will be dustered by 
single linkage together with all single-domain pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single multido- 
main protein, the number of real dusters must 
always be less than or equal to the number of 
complete dusters, because it is impossible to place 
a unique multidomain protein into a complete clus- 
ter. Thus, the single-linkage and complete dusters, 
plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
plexity of different organisms' predicted protein set 
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arrangements. Thus, the probability of chance occur- 
rence is I/rV* -1 . Allowing for both sets of genes (e.g., 
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increases this to L 2 /N*-' i . The duplicated segment 
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translocation; allowing for M such rearrangements 
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umanity has been given a great gift. With the completion' of the human.: 
genome sequence, we have received a powerful tool for unlocking the 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. 

This week's issue of Science contains the report of the sequencing of 
the human genome from a group of authors led by Craig Venterof Celera 
Genomics. The report of the sequencing of the human genome from the 

^ publicly funded consortium of laboratories led by Francis Collins appears 

«S3SS? ^^S^ in this week's Nature. This stunning achievement has been portrayed— 

often unfairly — as a competition between two 
ventures, one public and one private. That characterization detracts from . 
the awesome accomplishment jointly unveiled this week. In truth, each r, hlStOTIC 
project contributed to me other. The inspired vision that launched the 

publicly funded project roughly 10 years ago reflected, and now rewards, _ , 

the confidence of those who believe that the pursuit of large-scale funda- lT10iT16riL TOI 
mental problems in the life sciences is in the national interest. The technical 

innovation and drive of Craig Venter and his colleagues made it possible . . ^riehtif IC 
to celebrate this accomplishment far sooner than was believed possible. jwiwiimi 
Thus we can salute what has become, in the end, not a contest but a 

marriage (perhaps encouraged by shotgun) between public funding and endeaVOr. 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that ..... 
has eiven us two winners. Two sequences are better than one; the opportunity for comparison and con- 
vergence is invaluable. Indeed, a real-world proof of the importance : of access to both sete of date l can 
be found in the pages of this issue of Science, in the comparative anah/sis by Olivier et al. (p. 1298). 

Although we have made the point before, it is worth repeating that the sequencing of the human 
genome represents, not an ending, but the beginning of a new approach to biology. As Galas says^ 
his Viewpoint (p. 1257), the knowledge that all of the genetic components of any process can be 
identified will give extraordinary new power to scientists. Because of this breakthrough, research 
can evolve from analyzing the effects of individual genes to a more integrated view that examines 
whole ensembles of genes as they interact to form a living human being. Several articles m this issue 
highlight how this approach is already beginning to revolutionize the way we look at human Jseas^ 
This has been a massive project, on a scale unparalleled in the history of biology, but of course 
it has built on the scientific insights of centuries of investigators. By confidence, this landrnark 
announcement falls during the week of the anniversary of the birth of Charles Darwin. Darwm 
message that the survival of a species can depend on its ability to evolve in the face of change is 
peculiarly pertinent to discussions that have gone on in the past year oyer access to me Celera daUu 
0*11 information regarding the agreements that were reached to make £ 
found at www.sciencemag.org/feature/data/announcement/gsp.shlO i We are willing to be flexible m . 
allowing data repositories other than the traditional GenBank, while uisistmg on access to all the 
data needed to verify conclusions. In this domain, change is ^^}«f- C ^ m ^^Z 
are producing more and more potentially valuable sequences, yet (at least in the United States) 
hwsgoveSg databases provide scant protection against piracy. Had the Celera data been kept; se- 
cret ftwoddhave been a serious loss to "the scientific coirmiumty. We hope mat our adaptabihty m 
the face of change will enable other proprietary data to be published after peer review, in a way that 
satisfies our continuing commitment to full access. • „ . ; ^ A • 

It should be no surprise that an achievement so stuiming, and so carefully watched^ has created 
new challenges for the scientific venture. Science is proud to have played a role in bnngmg ; ttus 
SovSy onto the public stage. It is literally true that this is a historic moment for the sci^c en- 
deavor The human genome has been called the Book of Life. Rather, it is a library, in which, with 
nJ« mat encourage exploration and reward creativity, we can find many of the books that will , 
help define us and our place in the great tapestry of life. .= . . 

u * v Barbara RJasny and Donald Kennedy 
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Sequences producing significant alignments: 

AC097715. 3.1. 143642 
AC019105. 7. 1.170491 
AC019159. 8. 1.163085 
AC104648. 2. 1.112084 
AC074362 .5.1.149705 
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>AC097715. 3. 1.143642 

Length = 143642 

Score = 526 bits (265), Expect = e-146 
Identities = 266/267 (99%) 
Strand = Plus / Plus 
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Sbjct: 21277 gggcaatgtcactttttcctgctccgaaccacagattgtgcccatcacatttgtcaactc 21336 
Query: 1122 cagcggcagctatttgctgctgcccggcaccccccaaattgatgggctctcagtgagttt 1181 

llllllll Mill 1 1 1 Ml IM I !l I IMM 1 1 1 IN M I ! 1 1 M MM II I llllllll . 

Sbjct: 21337 cagcggcagctatttgctgctgcccggcaccccccaaattgatgggctctcagtgagttt 21396 
Query: 1182 ccagtttcgaacatggaacaaggatggtctgcttctgtccacagagctgtctgagggctc 1241 

1 1 1 1 1 ! I! 1 1 1 m ; : J M; IM 1 1 1 1 Ml M I : I IMM 1 1 1 1 1 1 1 ! I ! : I i 1 1 1 1 1 M 

Sbjct: 21397 ccagtttcgaacatggaacaaggatggtctgcttctgtccacagagctgtctgagggctc 21456 
Query: 1242 gggaaccctgctgctgagcctggagggtggaatcctgagactcgtgattcagaaaatgac 1301 

MIMIMIIMI 1 1 1 1 1 1 1 1 1 M Ml IN IN MUM ' I II 1 1 MM Ml llllllll 

Sbjct: 21457 gggaaccctgctgctgagcctggagggtggaatcctgagactcgtgattcagaaaatgac 21516 
Query: 1302 agaacgcgtagctgaaatcctcacagg 1328 

MMMMMMII II MM II 1 1 M I 

Sbjct: 21517 agaacgcgtagctgaaatcctcacagg 21543 



Score = 349 bits (176), Expect = 7e-93 
Identities = 176/176 (100%) 
Strand = Plus / Plus 

Query: 1476 agggtgccccgacaatctcaccgattcccaatgtttaaatcccattaaggctttccaagg 1535 

MMMIMIIII 1 1 1 1 1 II 1 1 1 M 1 1 Ml II MMMI II I II I MIMMMIII Ml 

Sbjct: 44269 agggtgccccgacaatctcaccgattcccaatgtttaaatcccattaaggctttccaagg 44328 



Ouery- 1536 ctgcatgaggctcatctttattgataaccagcccaaggacctcatttcagttcagcaagg 1595 

mTmIMMIMMMMMMMIMIIMIMMIMIMM MM MIMIII 4 

Sbjct: 44329 ctgcatgaggctcatctttattgataaccagcccaaggacctcatttcagttcagcaagg 44 jb 



Ouerv- 1596 ttccctggggaattttagtgatttacacattgatctgtgtagcatcaaagacaggt 1651 

, 1 1 1 1 1 iTTTi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ) i ii 1 1 1 1 1 1 1 1 mi i m 1 1 1 1 1 1 ■ 

Sbjct: 44389 ttccctggggaattttagtgatttacacattgatctgtgtagcatcaaagacaggt 44444 



Score = 305 bits (154), Expect = le-79 
Identities = 154/154 (100%) 
Strand = Plus / Plus 



Query: 1325 caggcagcaacttgaatgatggcctgtggcactcggttagcatcaacgccaggaggaacc 1384 



lllllllllllllllilllMIMI IIIIIIMIMI II II MINIM 

Sbjct: 41286 caggcagcaacttgaatgatggcctgtggcactcggttagcatcaacgccaggaggaacc 41345 
Query: 1385 gcatcacgctcactctggatgatgaagcagcacccccggctccagacagcacttgggtgc 1444 

1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 II 1 1 1 IN I' 1 1 1 ■ 1 1 1 II 1 1 ■ III 4140 _ 

Sbjct: 41346 gcatcacgctcactctggatgatgaagcagcacccccggctccagacagcacttgggtgc 41405 

Ouerv: 1445 agatttattctggaaatagctactattttggagg 1478 

I 1 i 1111 III I i 1 I I 
Sbjct: 41'406 agatttattctggaaatagctactattttggagg 41439 



Score = 238 bits (120), Expect = 2e-59 
Identities = 120/120 (100%) 
Strand = Plus / Plus 

Ouerv 1757 ccatctacgagcaatcctgcgaggtgtacaggcaccaggggaatacagccggcttcttct 1816 

* m ' Ill 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 MMM 126846 

Sbjct: 126787 ccatctacgagcaatcctgcgaggtgtacaggcaccaggggaatacagccggcttcttct izbvib 
Query- 1817 acatcgactcagatggcagcggcccactgggacctctccaggtgtactgcaatatcactg 1876 

Q ^ . MIMIIIIIIIIIIMIMI 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 MMM MM] g 

Sbjct: 126847' acatcgactcagatggcagcggcccactgggacctctccaggtgtactgcaatatcactg 126906 



Score = 216 bits (109) , Expect = 7e-53 
Identities = 109/109 (100%) 
Strand = Plus / Plus 



Query- 1648 aggtgtttgccaaactactgtgaacatggaggaagctgctcccagtcctggactaccttc 1707 

Mil IIIIIIMIMI II 1 1 1 1 1 M II I II 1 1 MIMMIMI III Ml 

Sbjct: 80201 aggtgtttgccaaactactgtgaacatggaggaagctgctcccagtcctggactaccttc 80260 



Query: 1708 tattgtaactgcagtgacacaagttacactggtgccacctgccacaact 1756 

1 1 M I [ 1 1 1 MM 1 1 1 MIIIIIIIIIIIIIIIIIIIIM 1 1 MM M 

Sbjct: 80261 tattgtaactgcagtgacacaagttacactggtgccacctgccacaact 80309 



>AC019105. 7. 1.170491 

Length = 170491 

Score = 482 bits (243), Expect = e-133 
Identities = 243/243 (100%) 
Strand = Plus / Plus 

Query: 2750 tagggggaacgtcatccagacagaaaggcttcctaggatgcattcgctccttacacttga 2809 

llllllllllilllll I MINIMI Illlllllll _ 

Sbjct : 152 tagggggaacgtcatccagacagaaaggcttcctaggatgcattcgctccttacacttga 211 
Query: 2810 atggacagaaaatggacctggaagagagggcaaaggtcacatctggagtcaggccaggct 2869 

1 1 1 1 IIIMMIIIMIIMI I M „, 

Sbjct: 212 atggacagaaaatggacctggaagagagggcaaaggtcacatctggagtcaggccaggct 271 
Query* 2870 gccccggccactgcagcagctacggcagcatctgccacaacgggggcaagtgtgtggaga 2929 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii Q1 

Sbjct : 272 gccccggccactgcagcagctacggcagcatctgccacaacgggggcaagtgtgtggaga 331 
Query* 2930 agcacaatggctacctgtgtgattgcaccaattcaccttatgaagggcccttttgcaaaa 2989 

i 1 1 1 1 1 1 ii i 1 1 1 1 iiiiiiiiii 1 1 1 1 1 1 . 

Sbjct : 332 agcacaatggctacctgtgtgattgcaccaattcaccttatgaagggcccttttgcaaaa 391 

Query: 2990 aag 2992 

III 

Sbjct: 392 aag 394 



Score = 452 bits (228), Expect = e-124 
Identities = 228/228 (100%) 
Strand = Plus / Plus 

Query: 2991 agaggtttctgctgtttttgaggctggcacgtcggttacttacatgtttcaagaacccta 3050 

Illlllllll Ill Mill IIIIIIIIIIIIIMIIIIMIIIIIIIIIII _ 

Sbjct: 8347 agaggtttctgctgtttttgaggctggcacgtcggttacttacatgtttcaagaacccta 8406 
Query: 3051 tcctgtgaccaagaatataagcctctcatcctcagctatttacacagattcagctccatc 3110 

MMMMMMMMMMMMMMMMMMMMMMIMMMIMM Ml 

Sbjct: 8407 tcctgtgaccaagaatataagcctctcatcctcagctatttacacagattcagctccatc 8466 



Ouerv 3111 caaggaaaacattgcacttagctttgtgacaacccaggcacccagtcttttgctctttat 3170 

° ^' MINI I 1 1 1 1 Ill Illllllllllll Mil ^ 

Sbjct: 8467 caaggaaaacattgcacttagctttgtgacaacccaggcacccagtcttttgctctttat bb^b 
Ouerv- 3171 caattcttcttctcaggacttcgtggttgttctgctctgcaagaatgg 3218 

. II MINIM Mill II MINIMUM MINIM II Mill 1 1 II . 

Sbjct: 8527 caattcttcttctcaggacttcgtggttgttctgctctgcaagaatgg 8574 



Score = 440 bits (222), Expect = e-120 
Identities = 222/222 (100%) 
Strand = Plus / Plus 

j 

Query: 3434 cagagaatcttggtttggattctgaagttgctaaagcaaatgccatgggttttgctggat 3493 

I Ml 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 MNN 'I' I 'J. 113191 

Sbjct: 113132 cagagaatcttggtttggattctgaagttgctaaagcaaatgccatgggttttgctggat 113191 
Ouerv 3494 gcatgtcttccgtccagtacaaccacatagcaccactgaaggctgccctgcgccatgcca 3553 

° ^' llllllllllll IIIIIIIIIIIIIIIII1IIIIM MINIM 113251 

Sbjct: 113192 gcatgtcttccgtccagtacaaccacatagcaccactgaaggctgccctgcgccatgcca 1132bi 
Ouerv 3554 ctgtcgcgcctgtgactgtccatgggaccttgacggaatccagctgtggcttcatggtgg 3613 

I Ml MM 1 1 Ml I II II I MIMMIMIMMMMIMMMMMMMM 

Sbjct: 113252 ctgtcgcgcctgtgactgtccatgggaccttgacggaatccagctgtggcttcatggtgg lliiii 
Query: 3614 actcagatgtgaatgcagtgaccacggtgcattcttcatcag 3655 

MMMMMMMIMMMMMMMMIIIIMIIMI 

Sbjct: 113312 actcagatgtgaatgcagtgaccacggtgcattcttcatcag lliibJ 



Score = 395 bits (199), Expect = e-106 
Identities = 199/199 (100%) 
Strand = Plus / Plus 

Ouerv- 3726 aggggtgatagcagtggtgatattcatcatcttctgtatcatcggcatcatgacccggtt 3785 
Query. .| TTTT I T I I Ml I Mil I I I I II II I II II I II I II M M I I M M I II II I I II I I I I I . 

Sbjct: 124343 aggggtgatagcagtggtgatattcatcatcttctgtatcatcggcatcatgacccggtt 124402 

Query: 3786 cctctaccagcacaagcagtcacatcgtacgagccagatgaaggagaaggaatatccaga 3845 

M II II II M I I M I I M M M 1 1 1 M 1 1 1 Illllllllllll 

Sbjct: 124403 cctctaccagcacaagcagtcacatcgtacgagccagatgaaggagaaggaatatccaga 1244b.! 

Query: 3846 aaatttggacagttccttcagaaatgaaattgacttgcaaaacacagtgagcgagtgtaa 3905 

II I I I I II II II I I M M I I I I I I I I I II IIIIIIIIIIJI' 124522 

Sbjct: 124463 aaatttggacagttccttcagaaatgaaattgacttgcaaaacacagtgagcgagtgtaa iaiszz 



Query: 3906 acgggaatatttcatctga 3924 

I I I I I I I I I I I I I I I I I I I 
Sbjct: 124523 acgggaatatttcatctga 124541 



Score = 262 bits (132), Expect = le-66 
Identities =132/132 (100%) 
Strand = Plus / Plus 

Ouerv 3217 ggaagcttacaggttcgctatcacctaaacaaggaagaaacccatgtattcaccattgat 3276 

Mill I 1 1 1 1 Ml I M IN 

Sbjct : 75558 ggaagcttacaggttcgctatcacctaaacaaggaagaaacccatgtattcaccattgat vbbi / 
Ouerv- 3277 gcagataactttgctaacagaaggatgcaccacttgaagattaaccgagagggaagagag 3336 

iiiiiiiiiiiiiiii mi i linn mi i mi ,„„ 

Sbjct: 75618 gcagataactttgctaacagaaggatgcaccacttgaagattaaccgagagggaagagag /t>b// 
Query: 3337 cttaccattcag 3348 

11 E I E I II 1 1 i I 

Sbjct: 75678 cttaccattcag 75689 



Score = 180 bits (91), Expect = 4e-42 
Identities =91/91 (100%) 
Strand = Plus / Plus 

Ouerv- 3346 cagatggaccagcaacttcgactcagttataacttctctccggaagtagagttcagggtt 3405 

MINIM MUM Illlllll MM 

Sbjct: 79925 cagatggaccagcaacttcgactcagttataacttctctccggaagtagagttcagggtt 79984 
Query: 3406 ataaggtcactcaccttgggcaaagtcacag 3436 

MMIMIIIMI lllllll IMMIIIIII 

Sbjct: 79985 ataaggtcactcaccttgggcaaagtcacag 80015 



Score = 153 bits (77), Expect = le-33 
Identities = 77/77 (100%) 
Strand = Plus / Plus 

Ouerv- 3652 tcagatccttttgggaagacagatgagcgggaaccactcacaaatgctgttcgaagtgat 3711 

lllllllll Illllllllllllllllllllllllllllll Mil 

Sbjct: 121716 tcagatccttttgggaagacagatgagcgggaaccactcacaaatgctgttcgaagtgat izif/o 
Query: 3712 tcggcagtcatcggagg 3728 

1 1 1 M M 1 1 1 1 1 1 1 1 1 1 

Sbjct: 121776 tcggcagtcatcggagg 121792 



>AC019159. 8. 1.163085 

Length = 163085 



Score = 436 bits (220) , Expect = e-119 
Identities = 220/220 (100%) 
Strand = Plus / Plus 

Ouerv: 2534 ctccttcagagatcacctttgccatcgatgttgggaatggtcctgtggagcttgtagtcc 2593 

I I MM 1 1 MINI I Illlllllllllll 

Sbjct: 148143 ctccttcagagatcacctttgccatcgatgttgggaatggtcctgtggagcttgtagtcc 1482U2 
Ouerv- 2594 agtctccttctcttctgaatgacaaccaatggcactatgtccgggctgagaggaacctca 2653 

iiiiiiiiiiiii ii mi 1 1 in ilo ._ 

Sbjct: 148203 agtctccttctcttctgaatgacaaccaatggcactatgtccgggctgagaggaacctca 148262 
Ouerv * 2654 aggagacctccctgcaggtggacaaccttccaaggagcaccagggagacgtcggaggagg 2713 

iiiiiiiiiii iiiiiiiiii i i iiiiiiiiiii r 

Sbjct : 148263 aggagacctccctgcaggtggacaaccttccaaggagcaccagggagacgtcggaggagg 14b J22 
Query: 2714 gccattttcgactgcagctgaacagccagttgtttgtagg 2753 

1 1 1 III 1 1 M 1 1 MM I M 1 1 1 1 M 1 1 1 1 M I Ml I M I I 

Sbjct: 148323 gccattttcgactgcagctgaacagccagttgtttgtagg 148362 



Score = 401 bits (202), Expect = e-108 
Identities = 202/202 (100%) 
Strand = Plus / Plus 

Ouerv* 1876 gaggacaagatctggacatcagtgcagcacaacaatacagagctgacccgagtgcggggc 1935 

1 1 IIIIIIIIIII 1 1 II I II II MM M IIIIIIIIIII 

Sbjct: 23101 gaggacaagatctggacatcagtgcagcacaacaatacagagctgacccgagtgcggggc 23160 
Ouerv 1936 gctaaccctgagaagccctatgccatggccttggactacgggggcagcatggaacagctg 1995 

MM MM II IIIIIIIIIII II I Ml IIIIIIIIIII 

Sbjct: 23161 gctaaccctgagaagccctatgccatggccttggactacgggggcagcatggaacagctg 2322U 
Ouerv- 1996 gaggccgtgatcgacggctctgagcactgtgagcaggaggtggcctaccactgcaggagg 2055 

IIMIIIIIIIIII Ill IIIIMM Illlllllllllll 

Sbjct: 23221 gaggccgtgatcgacggctctgagcactgtgagcaggaggtggcctaccactgcaggagg 2328U. 
Query: 2056 tcccgcctgctcaacacgccgg 2077 

MMMMMMiMiiiiiii 

Sbjct: 23281 tcccgcctgctcaacacgccgg 23302 



Score = 339 bits (171) , Expect = 7e-90 
Identities = 171/171 (100%) 
Strand = Plus / Plus 



Ouerv 2363 gacgcttctggaacgccgtctcattttatacagaagcctcttacctccactttcctacct 2422 

• IMMIIIIMIIIIIIIIIIIIMIMIIIIIIIIIIIIIIIIIIIIII II III 

Sbjct: 139321 gacgcttctggaacgccgtctcattttatacagaagcctcttacctccactttcctacct ljSJHU 
Query: 2423 tccatgcggaattcagtgccgatatttccttctttttta^ 2482 

II M M 1 1 IIIIIIIIIIIIIIIIIIIIM IIIIIM I I 

Sbjct: 139381 tccatgcggaattcagtgccgatatttccttcttttttaaaaccacagcattatccggag 139440 
Ouerv 2483 ttttcctagaaaatcttggcattaaagacttcattcgactcgaaataagct 2533 

III Mill MINIUM MINI 

Sbjct: 139441 ttttcctagaaaatcttggcattaaagacttcattcgactcgaaataagct 139491 



Score = 315 bits (159), Expect = le-82 
Identities = 159/159 (100%) 
Strand = Plus / Plus 

Query: 2077 gatggaacaccatttacctggtggattgggcggtccaatgaaaggcacccttactgggga 2136 

1111111111111111 MMIMIMMMMMIIMMIMI II Mill 

Sbjct: 122572 gatggaacaccatttacctggtggattgggcggtccaatgaaaggcacccttactgggga izzbii 
Query: 2137 ggttcccctcctggggtccagcagtgtgagtgtggcctagacgagagctgcctgga^ 2196 

1 1 1 II 1 1 II 11 II II I M II II M I II MM II II II II I 

Sbjct: 122632 ggttcccctcctggggtccagcagtgtgagtgtggcctagacgagagctgcctggacatt i<^byi 

Ouerv 2197 cagcacttttgcaattgcgacgctgacaaggatgaatgg 2235 

II II II II M II I I I I I II I II I I MUM I 

Sbjct: 122692 cagcacttttgcaattgcgacgctgacaaggatgaatgg izz mu 



Score = 258 bits (130) Expect = 2e-65 
-Identities = 130/130 (100%) 
Strand = Plus / Plus 

Query- 2234 ggacaaatgatactggctttctttccttcaaagaccacttgcctgtcactcagatagtta 2293 

Q ^ ii iiiiiiiiiiiiiiiiii MMMMMiiiiiii in I 

Sbjct: 139015 ggacaaatgatactggctttctttccttcaaagaccacttgcctgtcactcagatagtta 
Query: 2294 tcactgataccgacagatcaaactcagaagccgcttggagaattggtcccttgcgttgct 2353 

II II I II II II 1 1 1 1 1 II 1 1 1 MMMMMMMM MM Ml UN II 

Sbjct: 139075 tcactgataccgacagatcaaactcagaagccgcttggagaattggtcccttgcgttgct ljyu^ 



Query: 2354 atggtgaccg 2363 

MINIUM 

Sbjct: 139135 atggtgaccg 139144 



>AC104648. 2 . 1.112084 . -J - . 

Length = 112084 

Score = 401 bits (202), Expect = e-108 
Identities = 205/206 (99%) 
Strand = Plus / Plus 

Query: 530 aatcagacgttgctgactttgatggccgaagctcacttctgtacaggttcaatcagaagt 589 

lllllll Mllll M 1 1 Mill I Mill I M IN ! 1 1 1 II 1 1 1 MM II II IIMMII 

Sbjct : 61554 aatcagatgttgctgactttgatggccgaagctcacttctgtacaggttcaatcagaagt 61613 
Query: 590 tgatgagtactctcaaagatgtgatctccctgaagttcaagagcatgcaaggagatgggg 649 

MIIIIIIIMI 1 1 1 1 1 1 Mill M I II Ml 1 1 1 1 1 II 1 1 1 Ml IMIIII II I Mllll 

Sbjct: 61614 tgatgagtactctcaaagatgtgatctccctgaagttcaagagcatgcaaggagatgggg 61673 
Query: 650 tcctgttccatggagaaggtcagcgtggagaccacatcaccttggaactccagaagggga 709 

IIIIIIIIIMI II 1 1 1 IMM! II 11 II I Mill II II I II I MIIMIII IIMMII 

Sbjct: 61674 tcctgttccatggagaaggtcagcgtggagaccacatcaccttggaactccagaagggga 61733 
Query: 710 ggctcgccctacacctcaatttgggt 735 

lllllllllllllll I IMIMIIII 

Sbjct: 61734 ggctcgccctacacctcaatttgggt 61759 



Score = 365 bits (184), Expect = le-97 
Identities = 185/186 (99%) 
Strand = Plus / Plus 

Query: 733 ggtgacagcaaagcgcggctcagcagcagcttgccctctgccaccctgggcagcctcctg 792 

Mllll Mill I Mllll I III III IIMMII Mill 

Sbjct: 73822 ggtgacagcaaagcgcggctcagcagcagcttgccctctgccaccctgggcagcctcctg 73881 
Query: 793 gatgaccagcactggcactyggtcctcattgagcgggtgggcaagcaggtgaacttcacg 852 

iiiiiiiiiiiiiiiiiii ii i i iiiiiiiiiiiiiii • 

Sbjct: 73882 gatgaccagcactggcactcggtcctcattgagcgggtgggcaagcaggtgaacttcacg 73941 
Query: 853 gtggacaagcacacacagcacttccgcaccaagggcgagacggatgccttagacattgac 912 

MMIIIIIMMIMIIIIIIII IIIIIMMIIIIMIIIIIIIIIIIIIIIIIIIII 

Sbjct: 73942 gtggacaagcacacacagcacttccgcaccaagggcgagacggatgccttagacattgac 74001 



Query: 913 tatgag 918 

MINI 

Sbjct: 74002 tatgag 74007 



Score = 296 bits (149), Expect = le-76 
Identities '= 149/149 (100%) 
Strand = Plus / Plus 

Ouerv 381 gacctttgcaggaaacatgaatgctgacagcgtggtgcaccacaagctattgcactcagt 440 

T Ml I Illllllllllllllllll IIIIIIIIIMI Mil ^ 

Sbjct: 44512 gacctttgcaggaaacatgaatgctgacagcgtggtgcaccacaagctattgcactcagt 44b /i 
Ouerv • 441 gagagcccgatttgttcgctttgtgcccctggaatggaatcccagtgggaagattggcat 500 

MMIIIIIII I MIMMIMIIMI MM 

Sbjct: 44572 gagagcccgatttgttcgctttgtgcccctggaatggaatcccagtgggaagattggcat 44bJi 
Query: 501 gagagtcgaggtctacggatgttcctata 529 

I ! 1 1 1 1 1 1 1 1 M I M : 1 1 1 1 1 1 1 1 1 ! 1 1 1 

Sbjct: 44632 gagagtcgaggtctacggatgttcctata 44660 



Score =288 bits (145), Expect = 2e-74 
Identities = 146/147 (99%) 
Strand = Plus / Plus 

Ouerv 917 agcttagttttggaggaattccagtaccaggaaaacctgggacctttttaaagaaaaact 976 

lllllllllllllllllllllllllllllllll MINIMUM I 

Sbjct: 101807 agcttagttttggaggaattccagtaccaggaaaacctgggacctttttaaagaaaaact luibbb 
Ouerv- 977 tccatggatgcatcgaaaacctttactacaatggagtaaacataattracctggctaaga 1036 

° ^' IMIIMIlT IIIMIIMI I Mill I Ml 1 1 IN I M noig26 

Sbjct: 101867 tccatggatgcatcgaaaacctttactacaatggagtaaacataattgacctggctaaga lUiy^b 
Query: 1037 gacgaaagcatcagatctatactgtgg 1063 

IIIIIIIIIIIIIMIIIIIIIIIIII 

Sbjct: 101927 gacgaaagcatcagatctatactgtgg 101953 



>AC074362. 5. 1.149705 

Length = 149705 



Score = 387 bits (195), Expect = e-104 
Identities = 195/195 (100%) 
Strand = Plus / Plus 



Query: 187 ggaac 



246 



'tggcggttggtccccagcagattccaatgctcaacagtggctccagatggacctg 

llllllllllllllllllllllllllll INN N 

Sbjct: 93165 ggaactggcggttggtccccagcagattccaatgctcaacagtggctccagatggacctg 93224 
Ouerv- 247 ggaaacagagtagagattacagcagtggccacgcagggaagatacggaagctctgactgg 306 

MIIIIMIIIM IIIIIIIIIIIIIIIIIMIIIIIIIIIIIMIMIIMMI 

Sbjct: 93225 ggaaacagagtagagattacagcagtggccacgcagggaagatacggaagctctgactgg 362V*k 
Ouerv- 307 gtgacgagttacagcctgatgttcagtgacacaggacgcaactggaaacagtacaaacaa 366 

iiiiiii iiiiiiiiii mill i ii mi 0 „ d4 

Sbjct: 93285 gtgacgagttacagcctgatgttcagtgacacaggacgcaactggaaacagtacaaacaa y^J44 
Query: 367 gaagacagcatctgg 381 

1 1 1 1 1 1 1 1 1 1 1 1 1! I 

Sbjct: 93345 gaagacagcatctgg 93359 



Score = 210 bits (106), Expect = 5e-51 
Identities = 106/106 (100%) 
Strand = Plus / Plus 

Ouery- 83 acaactgtgatgatccactagcatccctgctctctccaatggctttttccagttcctcag 142 

° ^ IIIIIMIIIIIIIIIM 1 1 1 1 1 1 Illlllllllllllllllll Ml 

Sbjct: 72671 acaactgtgatgatccactagcatccctgctctctccaatggctttttccagttcctcag 727iu 
Ouerv- 143 acctcactggcactcacagcccagctcaactcaactggagagttgg 188 

IIMIIllll MM II I Mill IIIIIII I II 1 1 II 

Sbjct: 72731 acctcactggcactcacagcccagctcaactcaactggagagttgg 72776 



>AC079154. 5. 1.175387 

Length = 175387 

Score = 163 bits (82), Expect = le-36 
Identities = 82/82 (100%) 
Strand = Plus / Plus 



Ouerv- 1 atggattctttaccacggctgaccagcgttttgactttgctgttctctggcttgtggcat 60 

° ^ IIIIIII I Illl 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 124330 atggattctttaccacggctgaccagcgttttgactttgctgttctctggcttgtggcat 124 J»y 



Query: 61 ttaggattaacagcgacaaact 82 

1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 [ 1 1 1 1 1 

Sbjct: 124390 ttaggattaacagcgacaaact 124411 
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□ 1: AC097715. Homo sapiens BAC ...[gi:18056738] 

LOCUS AC097715 143642 bp DNA linear PRI 21-FEB-2002 

DEFINITION Homo sapiens BAC clone RP11-563A13 from 2, complete sequence. 
ACCESSION AC097715 AC027111 
VERSION AC097715.3 GI:18056738 

KEYWORDS HTG . 

SOURCE Homo sapiens (human) 

ORGANISM Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
Mammalia ; Eutheria ; Primates ; Catarrhini ; Hominidae ; Homo . 
REFERENCE 1 (bases 1 to 143642) 

AUTHORS Sulston,J.E. and Waterston, R. 

TITLE Toward a complete human genome sequence 

JOURNAL Genome Res. 8 (11), 1097-1108 (1998) 

MEDLINE 99063792 
PUBMED 9847074 
REFERENCE 2 (bases 1 to 143642) 

AUTHORS * Tomlinson,C. and Kozl6wicz,A. . 

TITLE The sequence of Homo sapiens BAC clone RP11-563A13 

JOURNAL Unpublished (2001) 
REFERENCE 3 (bases 1 to 143642) 

AUTHORS Waterston, R . H . 

TITLE Direct Submission 

JOURNAL Submitted (21-OCT-2001) Genome Sequencing Center, Washington 

University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 
REFERENCE 4 (bases 1 to 143642) 

AUTHORS Waterston, R.H. 

TITLE Direct Submission 

JOURNAL Submitted ( 04-JAN-2002 ) Genome Sequencing Center, Washington 

University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
. MO 63108, USA 
REFERENCE .5 (bases 1 to 143642) 
AUTHORS Waterston, R. 
TITLE Direct Submission 

JOURNAL Submitted (21-FEB-2002 ) Department of Genetics, Washington 

University, 4444 Forest Park Avenue, St. Louis, Missouri 63108, USA 
COMMENT On Jan 4, 2002 this sequence version replaced qi : 17647085 . 

Genome Center 

Center: Washington University Genome Sequencing Center 
Center code: WUGSC 

Web site: http: //genome. wustl . edu/gsc 
Contact : sapiens@watson. wustl . edu 

Summary Statistics 

Center project name: H_NH0563A13 
Drafting Center: WIBR 
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□ 1: AC019105. Homo sapiens BAC ...[gi:13677157] 



Links 



LOCUS 

DEFINITION 

ACCESSION 

VERSION 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 
MEDLINE 
PUBMED 

REFERENCE 
AUTHORS. 
TITLE 
JOURNAL 

REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 

COMMENT 



Craniata; Vertebrata; Euteleostomi ; 
Catarrhini ; Hominidae ; Homo . 



AC019105 170491 bp DNA linear PRI 07-NOV-2001 

Homo sapiens BAC clone RP11-475A8 from 2, complete sequence. 
AC019105 

AC019105.7 GI: 13677157 
HTG. 

Homo sapiens (human) 
Homo sapiens 

Eukaryo ta ; Metazoa ; Chorda ta ; 
Mammalia ; Eutheria ; Primates ; 

1 (bases 1 to 170491) 
Sulston,J.E. and Waterston, R. 

Toward a complete human genome sequence 
Genome Res. 8 (11), 1097-1108 (1998) 
99063792 
9847074 

2 (bases 1 to 170491) . 

Belter, E., Doebber,A., Abbott, A. and Ahluwalia>R. 
The sequence of Homo sapiens BAC clone RP11-475A8 
Unpublished 

3 (bases 1 to 170491) 
Waterston, R.H. 
Direct Submission 

Submitted (30-DEC-1999) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 

4 (bases 1 to 170491) 
Waters ton , R . H . 
Direct Submission 

Submitted (19-APR-2001) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 

5 (bases 1 to 170491) 
Waterston, R.H. . 
Direct Submission 

Submitted (20-APR-2001) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 

6 (bases 1 to 170491) 
Waterston, R. 

Direct Submission 

Submitted (07-NOV-2001) Department of Genetics, Washington 
University, 4444 Forest Park Avenue, St. Louis, Missouri 63108, USA 
On Apr 19, 2001 this sequence version replaced gi : 10048052 . 
Genome Center 

Center: Washington University Genome Sequencing Center 

Center code: WUGSC 

Web site: http : / /genome . wustl . edu/gsc 
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NCBI Sequence Viewer 





Page 1 of 57 



5 NCBI 




ATC03 ATC CCCG^Ktf (ptQA TTA7 AT AC-C TCGaTCO ATO 



PubMed 



Nucleotide 



Protein 



Genome 



Structure 



PMC 



Taxonomy 



OMIM 



Search 



Nucleotide 



Ml for 




tjSlear 



Limits 



Preview/Index 



f Display^ 



History 



Boo 



default 



KJ Show: 



20 !H '^SBg&* 



File 




Clipboard Details 
^Get^Subsequence 



D 1: AC019159. Homo sapiens BAC ...[gi: 136771 16] 



Links 



LOCUS 

DEFINITION 

ACCESSION 

VERSION 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 
MEDLINE 
PUBMED 

REFERENCE 
AUTHORS 
TITLE 
JOURNAL 

REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
• AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 

COMMENT 



Craniata; Vertebrata; Euteleostomi ; 
Catarrhini ; Hominidae; Homo . 



AC019159 163085 bp DNA linear PRI 07-NOV-2001 

Homo sapiens BAC clone RP11-56018 from 2, complete sequence. 
AC019159 

AC019159.8 GI:13677116 
HTG. 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chorda ta; 
Mammalia; Eutheria; Primates; 

1 (bases 1 to 163085) 
Sulston,J.E. and Waterston, R. 

Toward a complete human genome sequence 
Genome Res. 8 (11), 1097-1108 (1998) 
99063792 
9847074 

2 (bases 1 to 163085) 

Goyea,E., Cotton,M. , Spalding, L. and Lehnert,L. 
The sequence of Homo sapiens BAC clone RP11-56018 
Unpublished 

3 (bases 1 to 163085) 
Waterston, R.H. 
Direct Submission 

Submitted (30-DEC-1999) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 

4 (bases 1 to 163085) 
Waterston, R.H. 
Direct Submission 

Submitted (19-APR-2001) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. 
MO 63108, USA 

5 (bases 1 to 163085) 
Waterston , R . H . 
Direct Submission 

Submitted (20-APR-2001) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. 
MO 63108, USA 

6 (bases 1 to 163085) 
Waterston, R. 

Direct Submission 

Submitted (07-NOV-2001) Department of Genetics, Washington 
University, 4444 Forest Park Avenue, St. Louis, Missouri 63108, USA 
On Apr 19, 2001 this sequence version replaced gi : 11276269 . 
Genome Center 

Center: Washington University Genome Sequencing Center 

Center code: WUGSC 

Web site: http: / /genome. wustl .edu/gsc 
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□ 1: AC104648. Homo sapiens BAC ...[gi: 18042341] 

LOCUS AC104648 112084 bp DNA linear PRI 21-FEB-2002 

DEFINITION Homo sapiens BAC clone RP11-45D4 from 2, complete sequence. 
ACCESSION AC104648 AC015602 
VERSION AC104648.2 GI: 18042341 

KEYWORDS HTG . 

SOURCE Homo sapiens ( human ) 

ORGANISM Homo sapiens 

Eukaryota ; Metazoa ; Chordata ; Craniata ; Vertebrata ; Euteleos tomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
REFERENCE 1 (bases 1 to 112084) 

AUTHORS Sulston,J.E. and Waterston,R. 
TITLE Toward a complete . human genome sequence 
JOURNAL Genome Res. 8 (11), 1097-1108 (1998) 
MEDLINE 99063792 
PUBMED 9847074 
REFERENCE 2 (bases 1 to 112084) 
AUTHORS. Bielicki,L. and Abbott, A. 

TITLE The sequence of Homo sapiens BAC clone RP11-45D4 

JOURNAL Unpublished (2001) 
REFERENCE 3 (bases 1 to 112084) 
AUTHORS Waters ton , R . H . 
TITLE Direct Submission 

JOURNAL Submitted (18-DEC-2001) Genome Sequencing Center, Washington 

University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 
REFERENCE 4 (bases 1 to 112084) 

AUTHORS Waterston,R.H. 

TITLE Direct Submission 

JOURNAL Submitted (03-JAN-2002) Genome Sequencing Center, Washington 

University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 
REFERENCE 5. (bases 1 to 112084) 
AUTHORS Waterston,R. 
TITLE Direct Submission 

JOURNAL Submitted (21-FEB-2002) Department of Genetics, Washington 

University, 4444 Forest Park Avenue, St. Louis, Missouri 63108, USA 
COMMENT On Jan 3, 2002 this sequence version replaced gi: 17921241. 

Genome Center 

Center: Washington University Genome Sequencing Center 
Center code: WUGSC 

Web site: http : / /genome .wustl . edu/gsc 
Contact : sapiens@watson .wustl . edu 

Summary Statistics 

Center project name: H_NH0045D04 
Drafting Center: WIBR 



Links 
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C 1: AC074362. Homo sapiens BAC ...[gi:14140337] 

LOCUS AC074362 149705 bp DNA linear PRI 07-NOV-2001 

DEFINITION Homo sapiens BAC clone RP11-1C10 from 2, complete sequence. 

ACCESSION AC074362 

VERSION AC074362 .5 GI : 14140337 

KEYWORDS HTG . 

SOURCE Homo sapiens (human) 

ORGANISM Homo sapiens 

Eukaryota ; Metazoa ; Chordata ; Craniata ; Vertebrata ; Euteleos tomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
REFERENCE 1 (bases 1 to 149705) 

AUTHORS Sulston,J.E. and Waterston,R. 

TITLE Toward a complete human genome sequence 

JOURNAL Genome Res. 8 (11), 1097-1108 (1998) 

MEDLINE 99063792 
PUBMED 9847074 
REFERENCE 2 (bases 1 to 149705) 

AUTHORS. Belter, E., Abbott, A. and Despot, J. 

TITLE The sequence of Homo sapiens BAC clone RP11-1C10 

JOURNAL Unpubl i shed 
REFERENCE 3 (bases 1 to 149705) 

AUTHORS Waterston,R.H. 

TITLE Direct Submission 

JOURNAL Submitted (29-JUL-2000) Genome Sequencing Center, Washington 

University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 
REFERENCE 4 (bases 1 to 149705) 

AUTHORS Waters ton , R . H . 

TITLE Direct Submission 

JOURNAL Submitted (17-MAY-2001) Genome Sequencing Center, Washington 

University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
. . MO. 63108, USA 
REFERENCE 5 (bases 1 to 149705) 
AUTHORS Waterston,R. . . 
TITLE Direct Submission 

JOURNAL Submitted (07-NOV-2001) Department of Genetics, Washington 

University, 4444 Forest Park Avenue, St. Louis, Missouri 63108, USA 
COMMENT On. May 17, 2001 this sequence version replaced gi: 13518223 . 

Genome Center 

Center: Washington University Genome Sequencing Center 
Center code: WUGSC 

Web site: http: //genome. wustl .edu/gsc 
Contact : sapiens@watson . wustl . edu 

Summary Statistics 

Center project name: H_NH0001C10 

^ ^ ^ _ — — » 

NOTICE: This sequence may not represent the entire insert of this 
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C 1: AC079154. Homo sapiens BAC ...[gi: 15778757] 



Links 



LOCUS 

DEFINITION 

ACCESSION 

VERSION 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 
MEDLINE 
PUBMED 

REFERENCE 
AUTHORS 
TITLE 
JOURNAL 

REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 

COMMENT 



AC079154 175387 bp DNA linear PRI 09-JAN-2002 

Homo sapiens BAC clone RP11-314E14 from 2, complete sequence. 
AC079154 

AC079154.5 GI: 15778757 
HTG. 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammal ia ; Eutheria ; Primates ; Catarrhini ; Hominidae ; Homo . 

1 (bases 1 to 175387) 
Sulston,J.E. and Waterston,R. 

Toward a complete human genome sequence 
Genome Res. 8 (11), 1097-1108 (1998) 
99063792 
9847074 

2 (bases 1 to 175387) 

McLellan,M., Cotton, M. and Doebber,A. . 

The sequence of Homo sapiens BAC clone RP11-314E14 

Unpublished (2001) 

3 (bases 1 to 175387) 
Waterston , R . H . 
Direct Submission 

Submitted (20-AUG-2000) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 

4 (bases 1 to 175387) 
Waterston, R.H. 
Direct Submission 

Submitted (26-SEP-2001) Genome Sequencing Center, Washington 
University School of Medicine, 4444 Forest Park Parkway, St. Louis, 
MO 63108, USA 

5 (bases 1 to 175387) 
Waterston, R. 

Direct Submission 

Submitted (09-JAN-2002) Department of Genetics, Washington 
University, 4444 Forest Park Avenue, St. Louis, Missouri 63108, USA 
On Sep 26, 2001 this sequence version replaced gi : 13654392 . 
Genome Center 

Center: Washington University Genome Sequencing Center 

Center code: WUGSC 

Web site: http : / /genome . wustl . edu/gsc 
Contact: sapiens@watson.wustl.edu 

Summary Statistics 

Center project name: H_NH0314E14 



NOTICE: This sequence may not represent the entire insert of this 
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[57] ABSTRACT 

The present invention provides polynucleotides (kin) which 
identify and encode novel protein kinases (KIN) expressed 
in various human cells and tissues. The present invention 
also provides for antisense sequences and oligonucleotides 
designed from the nucleotide sequences or their comple- 
ments. The invention further provides genetically engi- 
neered expression vectors and host cells for the production 
of purified KIN peptides, antibodies capable of binding KIN, 
and inhibitors specifically bind KIN. The invention specifi- 
cally provides for diagnostic kits and assays which identify 
a disorder or disease with altered kinase expression and 
allow monitoring of patients during drug therapy. These 
assays utilize oligonucleotides or antibodies produced using 
the kin polynucleotides. 

4 Claims, No Drawings 
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HUMAN KINASE HOMOLOGS 

FIELD OF THE INVENTION 

The present invention is in the field of molecular biology; 
more particularly, the present invention describes nucleic 
acid sequences for novel human kinase homologs. 

BACKGROUND OF THE INVENTION 

Kinases regulate many different cell proliferation, 
differentiation, and signalling processes by adding phos- 
phate groups to proteins. Uncontrolled signalling has been 
implicated in inflammation, oncogenesis, arteriosclerosis, 
and psoriasis. Reversible protein phosphorylation is the 
main strategy for controlling activities of eukaryotic cells. It 
is estimated that more than 1000 of the 10,000 proteins 
active in a typical mammalian cell are phosphorylated. The 
high energy phosphate which drives activation is generally 
transferred from adenosine triphosphate molecules (ATP) to 
a particular protein by protein kinases and removed from 
that protein by protein phosphatases. 

Phosphorylation occurs in response to extracellular sig- 
nals (hormones, neurotransmitters, growth and differentia- 
tion factors, etc), cell cycle checkpoints, and environmental 
or nutritional stresses and is roughly analogous to the 
turning on a molecular switch. When the switch goes on, the 
appropriate protein kinase activates a metabolic enzyme, 
regulatory protein, receptor, cytoskeletal protein, ion chan- 
nel or pump, or transcription factor. 

The kinases comprise the largest known protein family, a 
superfamily of enzymes with widely varied functions and 
specificities. They are usually named after their substrate, 
their regulatory molecules, after some aspect of a mutant 
phenotype or arbitrarily. Almost all kinases contain a similar 
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The second-messenger dependent protein kinases prima- 
rily mediate the effects of second messengers such as cyclic 
AMP (cAMP) cyclic GMP, inositol triphosphate, 
phosphatidylinositol, 3,4,5 -triphosphate, cyclic ADPribose, 
arachidonic acid and diacylglycerol. For purposes of 
example, the structure and function of cyclic AMP- 
dependent protein kinase (A-kinase) will be described. 
Mammalian cells generally contain at least two forms of 
A-kinase; type 1 which is cytosolic, and type 2 which is 
bound to plasma membrane, nuclear membrane or microtu- 
bules. In its inactive state, A-kinase consists of a complex of 
two catalytic subunits and two regulatory subunits. When 
each regulatory subunit has bound two molecules of cAMP, 
the catalytic subunit is activated and can transfer a high 
energy phosphate from ATP to the serine or threonine of a 
substrate protein. Substrate proteins are usually marked by 
the presence of two or more basic amino acids on their 
amino terminal sides, A-kinase is important in metabolism 
of glycogen, for inactivation of phosphatase inhibitor 
protein, in transcription of genes which contain a regulatory 
region called the cAMP response element (CRE), and in 
regulation of the ion channels of olfactory neurons. 

Protein kinase C (PKC) is a water-soluble, Ca" 
dependent kinase, commonly found in brain tissue, which 
moves to the plasma membrane in the presence of Ca** ions. 
Approximately half of the known isoforms of PKC are 
activated initially by diacylglycerol and phosphatidylserine. 
Prolonged activation of PKC depends on continued produc- 
tion of diacyglycerol molecules which are formed when 
phospholipases cleave phosphatidylcholine. In nerve cells, 
PKC phosphorylates ion channels and alters the excitability 
of the cell membrane. In other cells, activation of PKC 
increases gene transcription either by triggering a protein 
kinase cascade which activates a regulatory element (much 



250-300 amino acid catalytic domain. The N-terminal 35 like CRE above) or by phosphorylating and deactivating an 

domain, which contains subdomains I-IV, generally folds inhibitor of the regulatory protein. 

into a two-lobed structure and binds and orients the ATP (or Ca ++ /calmodulin-dependent protein kinases (CaM- 

GTP) donor molecule. The larger C terminal lobe, which kinases) mediate most of the actions of Ca** in human cells, 

contains subdomains V1A-XI, binds the protein substrate The CaM-kinases include enzymes with narrow substrate 

and carries out the transfer of the gamma phosphate from ^ specificity such as myosin light chain kinase which activates 

ATP to the hydroxyl group of a serine, threonine, or tyrosine smooth muscle contraction and phosphorylase kinase which 

residue. Subdomain V spans the two lobes. activates glycogen breakdown and the multifunctional 

The kinases may be categorized into families by the enzyme, CaM-kinase II which is found in all cells. Phos- 

different amino acid sequences (generally between 5 and phorylase kinase has four subunits: y is the catalytic moiety 

100 residues) located on either side of, or inserted into loops 45 and P and 0° regulatory. Since subunits a and 0 are 

of, the kinase domain. These added amino acid sequences phosphorylated by A-kinase and subunit dS is Ca**/ 

allow the regulation of each kinase as it recognizes and calmodulin, glycogen breakdown can be activated by either 

interacts with its target protein. The primary structure of the cAMP or Ca^. 

kinase domains is conserved and can be further subdivided CaM-kinase II is particularly enriched in catecholamine 

into 12 subdomains. The following residues are relatively 50 synapses. In those neurons, Ca** influx stimulates both the 

(-95%) invariant: G 50 and G 52 in subdomain I, K^ in release of dopamine, noradrenaline or adrenaline and also 

subdomain II, G 91 in subdomain III, E^g in subdomain VIII, their resynthesis through the activation of CaM-kinase II. 

D 220 and G^ in subdomain DC, and the motifs or patterns Although the main role of CaM-kinase II is phosphorylation 

of amino acids in subdomains VIB, VIII and IX (Hardie G. of tyrosine hydroxylase, the rate-limiting enzyme of cat- 

and Hanks S. (1995) The Protein Kinase Facts Books, I and 55 echolamine synthesis, CaM-kinase II also autophosphory- 

II, Academic Press, San Diego, Calif.). lates and remains active until phosphotases overwhelm it. 

The cyclin dependent protein kinase (cdk) family includes Transmembrane protein-tyrosine kinases are receptors for 

proteins which are turned on and off as the cell proceeds most growth factors. The first characterized receptor for 

through the cell cycle. A cdk is active as a kinase only when epidermal growth factor (EGF) is a single pass transmem- 

it is bound to a cyclin. Cdk activation simultaneously 60 brane protein of about 1200 amino acids with an extracel- 

requires both the addition of a high energy phosphate to a hilar glycosylated portion that interacts with the 53 amino 

threonine residue by a kinase and the removal of a acid EGF molecule. Binding activates the transfer of a 

covalently-bound phosphate from a specific tyrosine residue phosphate group from ATP to selected tyrosine side chains 

by a phosphatase. The concentration of some cyclins rises of the receptor and other specific proteins. Other protein 

gradually through a particular part of the cell cycle until their 65 receptors with similar structure include the following growth 

targeted proteolysis ends the coordinated interaction among and differentiation factors (GF) — platelet derived GF, fibro- 

the cyclin, kinase, and phosphatase molecules. blast GF, hepatocyte GF, insulin and insulin-like GFs, nerve 
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GF, vascular endothelial GF, macrophage colony stimulating 
factor, etc. Each protein pbosphorylates itself by receptor 
dimerization to initiate the intracellular signalling cascade. 

Many protein-tyrosine kinases lack transmembrane 
regions and form a complex with the intercellular regions of 
other cell surface receptors. The best known NR-PTKs are 
the Src kinase family (Src, Yes, Fgr, Fyn, Lck, Lyn, Hck, 
Blk, etc) and the Janus kinase family (Jakl, Jak2, Jak3, 
Tyk2, etc). The Src PTKs are located on the cytoplasmic side 
of the plasma membrane and are characterized by Src 
homology regions 2 and 3 (SH2 and SH3). Src PTKs 
recognize short peptide motifs bearing phosphotyrosine or 
proline residues, respectively, and mediate protein-protein 
interactions that regulate a whole range of intracellular 
signalling molecules. Janus PTKs contain PTK or PTK-like 15 
domains and interact with growth hormone, prolactin, and 
some of the same cytokine receptors as Src PTKs. The 
cytokine receptors are unique both in their ability to recruit 
multiple PTKs and in the diversity of their intracellular 
domains which allow flexibility in their responses within 20 
different cell types (Taniguchi T. (1995) Science 
268:251-55). Src and Jak kinases were first identified as the 
products of mutant oncogenes in cancer cells where their 
activation was no longer subject to normal cellular controls. 

Extracellular signalling proteins such as transforming 
growth factor-p (TGF-p), activins, bone morphogenetic 
protein, and related members of the TGF-P superfamily 
interact with receptor serine/threonine kinases. Like EGF 
above, these receptor kinases have a single pass transmem- 
brane domain with a serine/threonine kinase residue on the 30 
cytosolic side of the plasma membrane. The signalling 
pathways which are activated by binding the extracellular 
signalling molecules are presently under investigation. 

Mitogen-activated protein (MAP) kinases also regulate 35 
intracellular signalling pathways. They mediate signal trans- 
duction from cell surface to nuclei via phosphorylation 
cascades. Several subgroups have been identified, and each 
manifests different substrate specificities and responds to 
distinct extracellular stimuli (Egan S. E. and Weinberg R. A. 
(1993) Nature 365:781-783). 

MAP kinase signalling pathways are present in mamma- 
lian cells as well as in yeast. The extracellular stimuli which 
activate mammalian pathways include epidermal growth 
factor (EGF), ultraviolet light, hyperosmolar medium, heat 45 
shock, endotoxic lipopolysaccharide (LPS), and pro- 
inflammatory cytokines such as tumor necrosis factor (TNF) 
and interleukin-1 (IL-1). In Saccharomyces cerevisiae, 
exposure to mating pberomone or hyperosmolar environ- 
ments activate the various MAP kinase signalling pathways. 50 

Mammalian cells have at least three subgroups of MAP 
kinases (Derijard B. et al (1995) Science 267:682-5), each 
distinguished by a tripeptide motif. They are extracellular 
signal-regulated protein kinases (ERK) characterized by 
Thr-Glu-Tyr; c-Jun amino- terminal kinases (JNK) charac- 55 
terized by Thr-Pro-Tyr; and p38 kinase characterized by 
Thr-Gly-Tyr. Each subgroup is activated by dual phospho- 
rylation of threonine and tyrosine residues by MAP kinase 
kinases located upstream of the phosphorylation cascade. 
Activated MAP kinases, in turn, phosphorylate downstream 50 
effectors ultimately leading to intracellular changes. 

The ERK signal transduction pathway is activated via 
tyrosine kinase receptors on the plasmalemma. When 
growth factors bind to tyrosine, they bind to noncatalytic, 
Src homology (SH) adaptor proteins (SH2-SH3-SH2) and a 65 
guanine nucleotide releasing protein (GNRP). GNRP 
reduces GTP and activates Ras proteins, members of the 
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large family of guanine nucleotide binding proteins 
(G-proteins). Activated Ras proteins bind to a protein kinase 
C-Raf-1 and activate the Raf-1 proteins. The activated Raf-1 
kinase subsequently phosphorylates MAP kinase kinase 
(MKK) which, in turn, activate ERKs. 

ERKs are proline-directed protein kinases which phos- 
phorylate Ser/Tbr-Pro motifs. In fact, cytoplasmic phospho- 
lipase A2 (cPLA2) and transcription factor Elk-1 are sub- 
strates of ERKs. The ERKs phosphorylate Ser S05 of cPIj\2 
thereby increasing its enzymatic activity and resulting in 
release of arachidonic acid and the formation of lysophos- 
pholipids from membrane phospholipids. Likewise, phos- 
phorylation of the transcription factor Elk-1 by ERK ulti- 
mately increases transcriptional activity. 

JNK is distantly related to the ERK and is similarly 
activated by dual phosphorylation of Thr and Tyr and by 
MKK4 (Davis R (1994) TIBS 19:470-473). The JNK signal 
transduction pathway is also initiated by ultraviolet light, 
osmotic stress, and the pro-inflammatory cytokines, TNF 
and IL-1. Phosphorylation of Ser 63 and Ser 73 in the NHj- 
terminal domain of the transcription factor c-Jun increases 
transcriptional activity. 

p38 is a 41 kD protein containing 360-amino acids. Its 
dual phosphorylation is activated by the MKK3 and MKK4, 
heat shock, hyperosmolar medium, IL-1 or LPS endotoxin 
(Han J. et al (1994) Science 265:808-811). Sepsis produced 
by LPS is characterized by fever, chills, tachypnea, and 
tachycardia, and severe cases may result in septic shock 
which includes hypotension and multiple organ failure. 

Cells respond to LPS as a stress signal because it alters 
normal cellular processes and induces the release of sys- 
temic mediators such as TNF. CD14 is a 
glycosylphosphatidyl-inositol-anchored membrane glyco- 
protein which serves as a LPS receptor on the plasmalemma 
of monocytic cells. The binding of LPS to CD 14 causes 
rapid protein tyrosine phosphorylation of the 44- and 42-/ 
4(MeD isoforms of MAP kinases. Although they bind LPS, 
these MAP kinase isoforms do not appear to belong to the 
p38 subgroup. 

An detailed understanding of kinase pathways and signal 
transduction is beginning to reveal some mechanisms for 
interceding in the progression of inflammatory illnesses and 
of uncontrolled cell proliferation. The cDNAs, 
oligonucleotides, peptides and antibodies for the human 
kinases, which are the subject of this invention and are listed 
in Table 1, provide a plurality of. tools for studying signalling 
cascades in various cells and tissues and for diagnosing and 
selecting inhibitors or drugs with the potential to intervene 
in various disorders or diseases in which altered kinase 
expression is implicated. The disorders or diseases include, 
but not limited to, human X-linked agammaglobulinemia, 
nonspherocytic hemolytic anemia, atherosclerosis, carcino- 
mas (breast, ovary, renal, squamous cell and prostate), 
diabetes, gliomas, glomerular disease, hepatomegaly, Kar- 
posi's sarcoma, lymphoblastic and myelogenous leukemias, 
myoglobinuria, peptic ulcer disease, psoriasis, pulmonary 
fibrosis, restenosis, and septic shock due to cholera, 
Clostridium difficile, E. coli and Shigella (Isselbacher K. J. 
et al (1994) Harrison's Principles of Internal Medicine, 
McGraw-Hill, New York City; Levitzki A. and A. Gazit 
(1995) Science 267:1782-88). 

SUMMARY OF THE INVENTION 

The subject invention provides unique polynucleotides 
(SEQ ID NOs 1-44) which have been identified as novel 
human kinases (kin). These partial cDNAs were identified 
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among the polynucleotides which comprise various Incyte 
cDNA libraries. 

The invention comprises polynucleotides which are 
complementary to the kin sequences (SEQ ID Nos 1-44). 

The invention also comprises the use of kin sequences to 
identify arid obtain a full length human kinase cDNAs such 
as SEQ ID NO 45. 

The invention further comprises the use of oligomers 
from these kin sequences in a kinases kit which can be used 
to identify a disorder or disease with altered kinase expres- 
sion and provide a method for monitoring progress of a 
patient during drug therapy. 

Aspects of the invention include use of kin sequences or 
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having fewer nucleotides than about 6 kb, preferably fewer 
than about 1 kb which can be used as a probe. Such probes 
may be labelled with reporter molecules using nick 
translation, Klenow fill-in reaction, PCR or other methods 
well known in the art. After pretesting to optimize reaction 
conditions and to eliminate false positives, nucleic acid 
probes may be used in Southern, northern or in situ hybrid- 
izations to determine whether DNA or RNA encoding the 
protein is present in a biological sample, cell type, tissue, 
organ or organism. 

"Recombinant nucleotide variants" are polynucleotides 
which encode a protein. They may be synthesized by making 
use of the "redundancy" in the genetic code. Various codon 
substitutions, such as the silent changes which produce 
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recombinant nucleic acids derived from them to produce 15 specific restriction sites or codon usage-specific mutations, 

purified peptides. Still further aspects of the invention use ma y be introduced to optimize cloning into a plasmid or 

these purified peptides to identify antibodies or other mol- viral vector or expression in a particular prokaryotic or 

ecules with inhibitory activity toward a particular kinase, eukaryotic host system, respectively, 

group of kinases or disease. "Linkers" are synthesized palindromic nucleotide 

In addition, the invention comprises the use of kin specific 20 sequences which create internal restriction endonuclease 
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antibodies in assays to identify a disorder or disease with 
altered kinase expression and provides a method to monitor 
the progress of a patient during drug therapy. 

DESCRIPTION OF THE FIGURE 

FIGS. 1A and IB display the full length nucleotide 
sequence for human MAP kinase from stomach tissue (SEQ 
ID NO 45; Incyte Clone 214915E) and its predicted amino 
acid sequence. 

DETAILED DESCRIPTION OF THE 
INVENTION 

Definitions 

As used herein, the abbreviation for kinase in lower case 
(kin) refers to a gene, cDNA, RNA or nucleic acid sequence 
while the upper case version (KIN) refers to a protein, 
polypeptide, peptide, oligopeptide, or amino acid sequence. 

An "oligonucleotide" or "oligomer" is a stretch of nucle- 
otide residues which has a sufficient number of bases to be 



sites for ease of cloning the genetic material of choice into 
various vectors. "Polylinkers" are engineered to include 
multiple restriction enzyme sites and provide for the use of 
both those enzymes which leave 5* and 3' overhangs such as 
25 BamHI, EcoRI, PstI, Kpnl and Hind III or which provide a 
blunt end such as EcoRV, SnaBI and Stul. 

"Control elements" or "regulatory sequences" are those 
nontranslated regions of the gene or DNA such as enhancers, 
promoters, introns and 3' untranslated regions which interact 
30 with cellular proteins to carry out replication, transcription, 
and translation. They may occur as boundary sequences or 
even split the gene. They function at the molecular level and 
along with regulatory genes are very important in 
development, growth, differentiation and aging processes. 
35 "Chimeric" molecules are polynucleotides or polypep- 
tides which are created by combining one or more of 
nucleotide sequences of this invention (or their parts) with 
additional nucleic acid sequence(s). Such combined 
sequences may be introduced into an appropriate vector and 
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used in a polymerase chain reaction (PCR). These short 40 expressed to give rise to a chimeric polypeptide which may 
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sequences are based on (or designed from) genomic or 
cDNA sequences and are used to amplify, confirm, or reveal 
the presence of an identical, similar or complementary DNA 
or RNA in a particular cell or tissue. Oligonucleotides or 
oligomers comprise portions of a DNA sequence having at 
least about 10 nucleotides and as many as about 50 
nucleotides, preferably about 15 to 30 nucleotides. They are 
chemically synthesized and may be used as probes. 

"Probes" are nucleic acid sequences of variable length, 
preferably between at least about 10 and as many as about 
6,000 nucleotides, depending on use. They are used in the 
detection of identical, similar, or complementary nucleic 
acid sequences. Longer length probes are usually obtained 
from a natural or recombinant source, are highly specific and 
much slower to hybridize than oligomers. They may be 
single- or double-stranded and carefully designed to have 
specificity in PCR, hybridization membrane-based, or 
EUSA-like technologies. 
"Reporter" molecules are chemical moieties used for 



be expected to be different from the native molecule in one 
or more of the following kinase characteristics: cellular 
location, distribution, ligand-binding affinities, interchain 
affinities, degradation/turnover rate, signalling, etc. 
45 "Active" is that state which is capable of being useful or 
of carrying out some role. It specifically refers to those 
forms, fragments, or domains of an amino acid sequence 
which display the biologic and/or immunogenic activity 
characteristic of the naturally occurring kinase. 
50 "Naturally occurring KIN" refers to a polypeptide pro- 
duced by cells which have not been genetically engineered 
or which have been genetically engineered to produce the 
same sequence as that naturally produced. Specifically con- 
templated are various polypeptides which arise from post- 
55 transnational modifications. Such modifications of the 
polypeptide include but are not limited to acetylation, 
carboxylation, glycosylation, phosphorylation, lipidation 
and acylation. 

"Derivative" refers to those polypeptides which have been 

* • ***** 



labelling a nucleic or amino acid sequence. Tliey include, 60 chemically modified by such techniques as ubiquitinalion, 
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but are not limited to, radionuclides, enzymes, fluorescent, 
chemi-luminescent, or chromogenic agents. Reporter mol- 
ecules associate with, establish the presence of, and may 
allow quantification of a particular nucleic or amino acid 
sequence. 

A "portion" or "fragment" of a polynucleotide or nucleic 
acid comprises all or any part of the nucleotide sequence 
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labelling (see above), pegylation (derivatization with poly- 
ethylene glycol), and chemical insertion or substitution of 
amino acids such as ornithine which do not normally occur 
in human proteins. 

"Recombinant polypeptide variant" refers to any polypep- 
tide which differs from naturally occurring KIN by amino 
acid insertions, deletions and/or substitutions, created using 
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recombinant DNA techniques. Guidance in determining 
which amino acid residues may be replaced, added or 
deleted without abolishing characteristics of interest may be 
found by comparing the sequence of KIN with that of related 
polypeptides and minimizing the number of amino acid 
sequence changes made in highly conserved regions. 

Amino acid "substitutions" are defined as one for one 
amino acid replacements. They are conservative in nature 
when the substituted amino acid has similar structural and/or 
chemical properties. Examples of conservative replacements 
are substitution of a leucine with an isoleucine or valine, an 
aspartate with a glutamate, or a threonine with a serine. 

Amino acid "insertions" or "deletions" are changes to or 
within an amino acid sequence. They typically fall in the 
range of about 1 to 5 amino acids. The variation allowed in 
a particular amino acid sequence may be experimentally 
determined by producing the peptide synthetically or by 
systematically making insertions, deletions, or substitutions 
of nucleotides in the kin sequence using recombinant DNA 
techniques. 

A "signal or leader sequence" is a short amino acid 
sequence which or can be used, when desired, to direct the 
polypeptide through a membrane of a cell. Such a sequence 
may be naturally present on the polypeptides of the present 
invention or provided from heterologous sources by recom- 
binant DNA techniques. 

An "oligopeptide" is a short stretch of amino acid residues 
and may be expressed from an oligonucleotide. It may be 
functionally equivalent to and either the same length as or 
considerably shorter than a "fragment ", "portion ", or 
"segment" of a polypeptide. Such sequences comprise a 
stretch of amino acid residues of at least about 5 amino acids 
and often about 17 or more amino acids, typically at least 
about 9 to 13 amino acids, and of sufficient length to display 
biologic and/or immunogenic activity. 

An "inhibitor" is a substance which retards or prevents a 
chemical or physiological reaction or response. Common 
inhibitors include but are not limited to antisense molecules, 
antibodies, antagonists and their derivatives. 

A "standard" is a quantitative or qualitative measurement 
for comparison. Preferably, it is based on a statistically 
appropriate number of samples and is created to use as a 
basis of comparison when performing diagnostic assays, 
running clinical trials, or following patient treatment pro- 
files. The samples of a particular standard may be normal or 
similarly abnormal. 

"Animal" as used herein may be defined to include 
human, domestic (cats, dogs, etc), agricultural (cows, 
horses, sheep, goats, chicken, fish, etc) or test species (frogs, 
mice, rats, rabbits, simians, etc). 

"Disorders or diseases" in which altered kinase activity 
have been implicated specifically include, but are not limited 
to, human X-linked agammaglobulinemia, nonspherocytic 
hemolytic anemia, atherosclerosis, carcinomas (breast, 
ovary, renal, squamous cell and prostate), diabetes, gliomas, 
glomerular disease, hepatomegaly, Karposi's sarcoma, lym- 
phoblastic and myelogenous leukemias, myoglobinuria, 
peptic ulcer disease, psoriasis, pulmonary fibrosis, 
restenosis, and septic shock due to cholera, Clostridium 
difficile, E. coli and Shigella. 

Since the list of technical and scientific terms cannot be all 
encompassing, any undefined terms shall be construed to 
have the same meaning as is commonly understood by one 
of skill in the art to which this invention belongs. 
Furthermore, the singular forms "a", "an" and "the" include 
plural referents unless the context clearly dictates otherwise. 
For example, reference to a "restriction enzyme" or a "high 
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fidelity enzyme" may include mixtures of such enzymes and 
any other enzymes fitting the stated criteria, or reference to 
the method includes reference to one or more methods for 
obtaining cDNA sequences which will be known to those 

5 skilled in the art or will become known to them upon reading 
this specification. 

Before the present sequences, variants, formulations and 
methods for making and using the invention are described, 
it is to be understood that the invention is not to be limited 

10 only to the particular sequences, variants, formulations or 
methods described. The sequences, variants, formulations 
and methodologies may vary, and the terminology used 
herein is for the purpose of describing particular embodi- 
ments. The terminology and definitions are not intended to 

15 be limiting since the scope of protection will ultimately 
depend upon the claims. 

DESCRIPTION OF THE INVENTION 

The present invention provides for purified partial protein 
kinase cDNAs which were expressed in various human 
tissues and isolated therefrom. These sequences were iden- 
tified by their similarity to published or known open reading 
frames or untranslated control regions. Since protein kinases 
are associated with basic cellular processes such as cell 
proliferation, differentiation and cell signalling, these nucle- 
otide sequences are useful in the characterization of and 
delineation of normal and abnormal processes. Kinase 
nucleotide sequences are useful in diagnostic assays used to 
evaluate the role of a specific kinase in normal, diseased, or 
therapeutically treated cells. 

Purified kinase nucleotide sequences have numerous 
applications in techniques known to those skilled in the art 
of molecular biology. These techniques include their use as 
hybridization probes, for chromosome and gene mapping, in 
PCR technologies, in the production of sense or antisense 
nucleic acids, in screening for new therapeutic molecules, 
etc. These examples are well known and are not intended to 
be limiting. Furthermore, the nucleotide sequences disclosed 
herein may be used in molecular biology techniques that 
have not yet been developed, provided the new techniques 
rely on properties of nucleotide sequences that are currently 
known, including but not limited to such properties as the 
triplet genetic code and specific base pair interactions. 

45 As a result of the degeneracy of the genetic code, a 
multitude of kinase -encoding nucleotide sequences may be 
produced and some of these will bear only minimal homol- 
ogy to the endogenous sequence of any known and naturally 
occurring kinase. This invention has specifically contem- 

50 plated each and every possible variation of nucleotide 
sequence that could be made by selecting combinations 
based on possible codon choices. These combinations are 
made in accordance with the standard triplet genetic code as 
applied to the nucleotide sequence of naturally occurring 

55 kinases, and all such variations are to be considered as being 
specifically disclosed. 

Although the kinase nucleotide sequences and their 
derivatives or variants are preferably capable of identifying 
the nucleotide sequence of the naturally occurring kinase 

60 under optimized conditions, it may be advantageous to 
produce kinase-encoding nucleotide sequences possessing a 
substantially different codon usage. Codons can be selected 
to increase the rate at which expression of the peptide occurs 
in a particular prokaryotic or eukaryotic expression host in 

65 accordance with the frequency with which particular codons 
are utilized by the host. Other reasons for substantially 
altering the nucleotide sequence encoding the kinase without 
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altering the encoded amino acid sequence include the pro- 
duction of RNA transcripts having more desirable 
properties, such as a longer half-life, than transcripts pro- 
duced from the naturally occurring sequence. 

Nucleotide sequences encoding a kinase may be joined to 
a variety of other nucleotide sequences by means of well 
established recombinant DNA techniques (Sambrook J. et ai 
(1989) Molecular Cloning: A Laboratory Manual, Cold 
Spring Harbor Laboratory, Cold Spring Harbor, N.Y; or 
Ausubel F. M. et al (1989) Current Protocols in Molecular 
Biology, John Wiley & Sons, New York City). Useful 
sequences for joining to the kinase include an assortment of 
cloning vectors such as plasmids, cosmids, lambda phage 
derivatives, phagemids, and the like. Vectors of interest 
include vectors for replication, expression, probe generation, 
sequencing, and the like. In general, vectors of interest may 
contain an origin of replication functional in at least one 
organism, convenient restriction endonuclease sensitive 
sites, and selectable markers for one or more host cell 
systems. 

PCR as described in U.S. Pat. Nos. 4,683,195; 4,800,195; 
and 4,965,188 provides additional uses for oligonucleotides 
based upon the kinase nucleotide sequence. Such oligomers 
are generally chemically synthesized, but they may be of 
recombinant origin or a mixture of both. Oligomers gener- 
ally comprise two nucleotide sequences, one with sense 
orientation (5'->3 f ) and one with antisense (3' to 5') 
employed under optimized conditions for identification of a 
specific gene or diagnostic use. The same two oligomers, 
nested sets of oligomers, or even a degenerate pool of 
oligomers may be employed under less stringent conditions 
for identification and/or quantitation of closely related DNA 
or RNA sequences. 

Full length genes may be cloned utilizing partial nucle- 
otide sequence and various methods known in the art. 
Gobinda et al (1993; PCR Methods Applic 2:318-22) dis- 
close "restriction-site PCR" as a direct method which uses 
universal primers to retrieve unknown sequence adjacent to 
a known locus. First, genomic DNA is amplified in the 
presence of primer to linker and a primer specific to the 
known region. The amplified sequences are subjected to a 
second round of PCR with the same linker primer and 
another specific primer internal to the first one. Products of 
each round of PCR are transcribed with an appropriate RNA 
polymerase and sequenced using reverse transcriptase. 
Gobinda et al present data concerning Factor DC for which 
they identified a conserved stretch of 20 nucleotides in the 
3' noncoding region of the gene. 

Inverse PCR is the first method to report successful 
acquisition of unknown sequences starting with primers 
based on a known region (Triglia T. et al (1988) Nucleic 
Acids Res 16:8186). The method uses several restriction 
enzymes to generate a suitable fragment in the known region 
of a gene. The fragment is then circularized by intramolecu- 
lar ligation and used as a PCR template. Divergent primers 
are designed from the known region. The multiple rounds of 
restriction enzyme digestions and ligations that are neces- 
sary prior to PCR make the procedure slow and expensive 
(Gobinda et al, supra). 

Capture PCR (Lagerstrom M. et al (1991) PCR Methods 
Applic 1:111-19) is a method for PCR amplification of DNA 
fragments adjacent to a known sequence in human and YAC 
DNA. As noted by Gobinda et al (supra), capture PCR also 
requires multiple restriction enzyme digestions and ligations 
to place an engineered double-stranded sequence into an 
unknown portion of the DNA molecule before PCR. 
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Although the restriction and ligation reactions are carried 
out simultaneously, the requirements for extension, immo- 
bilization and two rounds of PCR and purification prior to 
sequencing render the method cumbersome and time con- 
suming. 

Parker J. D. et al (1991; Nucleic Acids Res 19:3055-60), 
teach walking PCR, a method for targeted gene walking 
which permits retrieval of unknown sequence. Promoter- 
Finder™ is a new kit available from Qontech (Palo Alto, 
Calif.) which uses PCR and primers derived from p53 to 
walk in genomic DNA. Nested primers and special Promot- 
erFinder libraries are used to detect upstream sequences 
such as promoters and regulatory elements. This process 
avoids the need to screen libraries and is useful in finding 
intron/exon junctions. 

Another new PCR method, "Improved Method for 
Obtaining Full Length cDNA Sequences" by Guegler et al, 
patent application Sen No 08/487,112, filed Jun. 7, 1995 and 
hereby incorporated by reference, employs XL-PCR 
(Perkin-Elmer, Foster City, Calif.) to amplify and extend 
partial nucleotide sequence into longer pieces of DNA. This 
method was developed to allow a single researcher to 
process multiple genes (up to 20 or more) at one time and to 
obtain an extended (possibly full-length) sequence within 
6-10 days. This new method replaces methods which use 
labelled probes to screen plasmid libraries and allow one 
researcher to process only about 3-5 genes in 14-40 days. 

In the first step, which can be performed in about two 
days, any two of a plurality of primers are designed and 
synthesized based on a known partial sequence. In step 2, 
which takes about six to eight hours, the sequence is 
extended by PCR amplification of a selected library. Steps 3 
and 4, which take about one day, are purification of the 
amplified cDNA and its ligation into an appropriate vector. 
Step 5, which takes about one day, involves transforming 
and growing up host bacteria. In step 6, which takes approxi- 
mately five hours, PCR is used to screen bacterial clones for 
extended sequence. The final steps, which take about one 
day, involve the preparation and sequencing of selected 
clones. 

If the full length cDNA has not been obtained, the entire 
procedure is repeated using either the original library or 
some other preferred library. The preferred library may be 
one that has been size-selected to include only larger cDNAs 
or may consist of single or combined commercially avail- 
able libraries, eg. lung, liver, heart and brain from Gibco/ 
BRL (Gaithersburg, Md.). The cDNA library may have been 
prepared with oligo (dT) or random priming. Random 
primed libraries are preferred in that they will contain more 
sequences which contain 5' ends of genes. A randomly 
primed library may be particularly useful if an oligo (dT) 
library does not yield a complete gene. It must be noted that 
the larger and more complex the protein, the less likely it is 
that the complete gene will be found in a single plasmid. 

A new method for analyzing either the size or the nucle- 
otide sequence of PCR products is capillary electrophoresis. 
Systems for rapid sequencing are available from Perkin 
Elmer (Foster, City Calif.), Beckman Instruments (Fullerton, 
Calif.), and other companies. Capillary sequencing employs 
flowable polymers for electrophoretic separation, four dif- 
ferent fluorescent dyes (one for each nucleotide) which are 
laser activated, and detection of the emitted wavelengths by 
a charge coupled devise camera. Output/light intensity is 
converted to electrical signal using appropriate software (eg. 
Genotyper™ and Sequence Navigators™ from Perkin 
Elmer) and the entire process from loading of samples to 
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computer analysis and electronic data display is computer 
controlled. Capillary electrophoresis provides greater reso- 
lution and is many times faster than standard gel based 
procedures. It is particularly suited to the sequencing of 
small pieces of DNA which might be present in limited 5 
amounts in a particular sample. The reproducible sequenc- 
ing of up to 350 bp of M13 phage DNA in 30 min has been 
reported (Ruiz-Martinez M. C. et al (1993) Anal Chem 
65:2851-8). 

Another aspect of the subject invention is to provide for 10 
kinase hybridization probes which are capable of hybridiz- 
ing with naturally occurring nucleotide sequences encoding 
kinases. The stringency of the hybridization conditions will 
determine whether the probe identifies only the native 
nucleotide sequence of that specific kinase or sequences of 
closely related molecules. If degenerate kinase nucleotide 15 
sequences of the subject invention are used for the detection 
of related kinase encoding sequences, they should preferably 
contain at least 50% of the nucleotides of the sequences 
presented herein. Hybridization probes of the subject inven- 
tion may be derived from the nucleotide sequences of the 20 
SEQ ID NOs 1-44, or from surrounding or included 
genomic sequences comprising untranslated regions such as 
promoters, enhancers and introns. Such hybridization probes 
may be labelled with appropriate reporter molecules. Means 
for producing specific hybridization probes for kinases 25 
include oligolabelling, nick translation, end-labelling or 
PCR amplification using a labelled nucleotide. Alternatively, 
the cDNA sequence may be cloned into a vector for the 
production of mRNA probe. Such vectors are known in the 
art, are commercially available, and may be used to synthe- 30 
size RNA probes in vitro by addition of an appropriate RNA 
polymerase such as T7, T3 or SP6 and labelled nucleotides. 
A number of companies (such as Pharmacia Biotech, 
Piscataway, N J.; Promega, Madison, Wis.; US Biochemical 
Corp, Cleveland, Ohio; etc.) supply commercial kits and 35 
protocols for these procedures. 

It is also possible to produce a DNA sequence, or portions 
thereof, entirely by synthetic chemistry. Sometimes the 
source of information for producing this sequence comes 
from the known homologous sequence from closely related 40 
organisms. After synthesis, the nucleic acid sequence can be 
used alone or joined with a preexisting sequence and 
inserted into one of the many available DNA vectors and 
their respective host cells using techniques well known in 
the art. Moreover, synthetic chemistry may be used to 45 
introduce specific mutations into the nucleotide sequence. 
Alternatively, a portion of sequence in which a mutation is 
desired can be synthesized and recombined with a portion of 
an existing genomic or recombinant sequence. 

The kinase nucleotide sequences can be used individually, 50 
or in panels, in a diagnostic test or assay to detect disorder 
or disease processes associated with abnormal levels of 
kinase expression. The nucleotide sequence is added to a 
sample (fluid, cell or tissue) from a patient under hybridizing 
conditions. After an incubation period, the sample is washed 55 
with a compatible fluid which optionally contains a reporter 
molecule which will bind the specific nucleotide. After the 
compatible fluid is rinsed off, the reporter molecule is 
quantitated and compared with a standard for that fluid, cell 
or tissue. If kinase expression is significantly different from 60 
the standard, the assay indicates the presence of disorder or 
disease. The form of such qualitative or quantitative meth- 
ods may include northern analysis, dot blot or other mem- 
brane based technologies, dip stick, pin or chip technologies, 
PCR, EUSAs or other multiple sample format technologies. 65 

This same assay, combining a sample with the nucleotide 
sequence, is applicable in evaluating the efficacy of a 
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particular therapeutic treatment regime. It may be used in 
animal studies, in clinical trials, or in monitoring the treat- 
ment of an individual patient. First, standard expression 
must be established for use as a basis of comparison. 
Second, samples from the animals or patients affected by the 
disorder or disease are combined with the nucleotide 
sequence to evaluate the deviation from the standard or 
normal profile. Third, an existing therapeutic agent is 
administered, and a treatment profile is generated. The assay 
is evaluated to determine whether the profile progresses 
toward or returns to the standard pattern. Successive treat- 
ment profiles may be used to show the efficacy of treatment 
over a period of several days or several months. 

The nucleotide sequence for any particular kinase (SEQ 
ID NOs 1-45) can also be used to generate probes for 
mapping the native genomic sequence. The sequence may be 
mapped to a particular chromosome or to a specific region 
of tie chromosome using well known techniques. These 
include in situ hybridization to chromosomal spreads 
(Verma et al (1988) Human Chromosomes: A Manual of 
Basic Techniques, Pergamon Press, New York City), flow- 
sorted chromosomal preparations, or artificial chromosome 
constructions such as yeast artificial chromosomes (YACs), 
bacterial artificial chromosomes (BACs), bacterial PI con- 
structions or single chromosome cDNA libraries. 

In situ hybridization of chromosomal preparations and 
physical mapping techniques such as linkage analysis using 
established chromosomal markers are invaluable in extend- 
ing genetic maps. Examples of genetic maps can be found in 
the 1994 Genome Issue of Science (265:1981f). Often the 
placement of a gene on the chromosome of another mam- 
malian species may reveal associated markers even if the 
number or arm of a particular human chromosome is not 
known. New partial nucleotide sequences can be assigned to 
chromosomal arms, or parts thereof, by physical mapping. 
This provides valuable information to investigators search- 
ing for disease genes using positional cloning or other gene 
discovery techniques. Once a disease or syndrome, such as 
ataxia telangiectasia (A3), has been crudely localized by 
genetic linkage to a particular genomic region, for example, 
AT to llq22-23 (Gatti et al (1988) Nature 336:577-580), 
any sequences mapping to that area may represent genes for 
further investigation. The nucleotide sequences of the sub- 
ject invention may also be used to detect differences in the 
chromosomal location of nucleotide sequences due to 
translocation, inversion, etc. between normal and carrier or 
affected individuals. 

The partial nucleotide sequence encoding a particular 
kinase may be used to produce an amino acid sequence using 
well known methods of recombinant DNA technology. 
Goeddel (1990, Gene Expression Technology, Methods and 
Enzymplogy, Vol 185, Academic Press, San Diego, Calif.) is 
one among many publications which teach expression of an 
isolated, purified nucleotide sequence. The amino acid or 
peptide may be expressed in a variety of host cells, either 
prokaryodc or eukaryotic. Host cells may be from the same 
species from which the nucleotide sequence was derived or 
from a different species. Advantages of producing an amino 
acid sequence or peptide by recombinant DNA technology 
include obtaining adequate amounts for purification and the 
availability of simplified purification procedures. 

Cells transformed with a kinase nucleotide sequence may 
be cultured under conditions suitable for the expression and 
recovery of peptide from cell culture. The peptide produced 
by a recombinant cell may be secreted or may be contained 
intracellularly depending on the sequence itself and/or the 
vector used. In general, it is more convenient to prepare 
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recombinant proteins in secreted form, and this is accom- 
plished by ligating kin to a recombinant nucleotide sequence 
which directs its movement through a particular prokaryotic 
or eukaryotic cell membrane. Other recombinant construc- 
tions may join kin to nucleotide sequence encoding a 
polypeptide domain which will facilitate protein purification 
(Kroll D. J. et al (1993) DNA Cell Biol 12:441-53). 

Direct peptide synthesis using solid-phase techniques 
(Stewart et al (1969) Solid-Phase Peptide Synthesis, WH 



an older procedure, the procedure presented in this applica- 
tion is exemplary of one currently being used by persons 
skilled in the art. For the purpose of providing an exemplary 
method, the tissue preparation, mRNA isolation and cDNA 
library construction described here is for the rheumatoid 
synovium library from which the Incyte Clones 191283 and 
192268 for ser/thr kinases were obtained. 

Rheumatoid synovial tissue was obtained from the hip 
joint removed from a 68 year old female with erosive, 



Freeman Co, San Francisco, Calif.; Merrifield J. (1963) J ™ nodular rheumatoid arthritis. The tissue was frozen, ground 
Am Chem Soc 85:2149-2154) is an alternative to recom- to powder in a mortar and pestle, and lysed immediately in 
binant or chimeric peptide production. Automated synthesis buffer containing guanidinium isothiocyanate. The lysate 
may be achieved, for example, using Applied Biosystems was centrifuged over a CsCl cushion (18 hrs at 25,000 rpm 
431 A Peptide Synthesizer in accordance with the instmc- using a Beckman SW28 rotor and ultracentrifuge; Beckman 
tions provided by the manufacturer. Additionally a particular 15 Instruments, Palo Alto, Calif.), ethanol precipitated, resus- 
kinase sequence or any part thereof may be mutated during pended in water and DNase treated for 15 min at 37° C. The 
direct synthesis and combined using chemical methods with RNAwas extracted with phenol chloroform and precipitated 
other kinase sequence^) or a part thereof. This chimeric with ethanol. Polyadenylated messages were isolated using 
nucleotide sequence can also be placed in an appropriate Qiagen Oligotex (QIAGEN Inc, Chatsworth, Calif.), and a 

20 custom cDNA library was constructed by Stratagene (La 
Jolla, Calif.). 

First strand cDNA synthesis was accomplished using an 
oligo (dT) primer/linker which also contained an Xhol 
restriction site. Second strand synthesis was performed 



vector and host cell to produce a variant peptide. 

Although an amino acid sequence or oligopeptide used for 
antibody induction does not require biological activity, it 
must be immunogenic. KIN used to induce specific anti- 
bodies may have an amino acid sequence consisting of at 



least five amino acids and preferably at least 10 amino acids. 25 a combination i of DNA polymerase I, E.coli ligase and 



RNase H, followed by the addition of an EcoRI linker to the 

blunt ended cDNA. The EcoRI linked, double-stranded 

cDNA was then digested with Xhol restriction enzyme, 

extracted with phenol chloroform, and fractionated by size 

30 on Sephacryl S400. DNA of the appropriate size was then 

ligated to dephosphorylated Lambda Zap® arms 

(Stratagene) and packaged using Gigapack extracts 

(Stratagene). pBluescript (Stratagene) phagemid DNAs 

were excised en masse from the library. 

35 In the alternative, DNAs were purified using Miniprep 

Kits (Catalog #77468; Advanced Genetic Technologies 

Corporation, Gaithersburg, Md.). These kits provide a 

96-well format and enough reagents for 960 purifications. 

r A . i i- i*u * r The recommended protocol supplied with each kit has been 

screening of recombinant immunoglobulin libraries for . , / 4 . f „ • «_ . » u r\c 

f- j- , i ,r\ i j- n . i /mon\ dwac 40 employed except for the following changes. First, the 96 

specific-bmding molecules (Orlandi R. et al (1989) PNAS ^ ^ ^ ^ ^ ^ x J> q{ Terrific ^ 

(LIFE TECHNOLGIES™, Gaithersburg, Md.) with carbe- 
nicillin at 25 mg/L (2xCarb) and glycerol at 0.4%. After the 
wells are inoculated, the bacteria are cultured for 24 hours 
45 and lysed with 60 /d of lysis buffer. A centrifugation step 
(2900 rpm for 5 minutes) is performed before the contents 
of the block are added to the primary filter plate. The 
optional step of adding isopropanol to TR1S buffer is not 
routinely performed. After the last step in the protocol, 
50 samples are transferred to a Beckman 96-well block for 
storage. 

II Sequencing of cDNA Clones 
The cDNA inserts from random isolates of the rheumatoid 
„ , . , • , j , synovium or other appropriate library were sequenced in 

5J^P^ ^l 0 ^! !^:f^ 55 part. Methods for DNAsequencing are well known in the art 

and employ such enzymes as the Klenow fragment of DNA 
polymerase I, SEQUENASE® (US Biochemical Corp) or 



Short stretches of amino acid sequence may be fused with 
those of another protein such as keyhole limpet hemocyanin, 
and the chimeric peptide used for antibody production. 
Alternatively, the oligopeptide may be of sufficient length to 
contain an entire domain. 

Antibodies specific for KIN may be produced by inocu- 
lation of an appropriate animal with an antigenic fragment of 
the peptide. An antibody is specific for KIN if it is produced 
against an epitope of the polypeptide and binds to at least 
part of the natural or recombinant protein. Antibody pro- 
duction includes not only the stimulation of an immune 
response by injection into animals, but also analogous 
processes such as the production of synthetic antibodies, the 

immune 

specific-binding molecules (Orlandi R. et al (1989) 
86:3833-3837, or Huse W. D. et al (1989) Science 
256:1275-1281), or the in vitro stimulation of lymphocyte 
populations. Current technology (Winter G. and Milstein C. 
(1991) Nature 349:293-299) provides for a number of 
highly specific binding reagents based on the principles of 
antibody formation. These techniques may be adapted to 
produce molecules which specifically bind kinase peptides. 
Antibodies or other appropriate molecules generated against 
a specific immunogenic peptide fragment or oligopeptide 
can be used in Western analysis, enzyme-linked immunosor- 
bent assays (ELIS A) or similar tests to establish the presence 
of or to quantitate amounts of kinase active in normal, 
diseased, or therapeutically treated cells or tissues. 



invention. These examples are provided by way of illustra 
tion and are not included for the purpose of limiting the 
invention. 



EXAMPLES 
I cDNA Library Construction 

The kinase sequences of this application (Table 1) were 
first identified among the sequences comprising various 
libraries. Technology has advanced considerably since the 
first cDNA libraries were made. Many small variations in 



Taq polymerase. Methods to extend the DNA from an 
oligonucleotide primer annealed to the DNA template of 
60 interest have been developed for both single- and double- 
stranded templates. Chain termination reaction products 
were separated using electrophoresis and detected via their 
incorporated, labelled precursors. Recent improvements in 
mechanized reaction preparation, sequencing and analysis 



both chemicals and machinery have been instituted over 65 have permitted expansion in the number of sequences that 
time, and these have improved both the efficiency and safety can be determined per day. Preferably, the process is auto- 
of the process. Although the cDNAs could be obtained using mated with machines such as the Hamilton Micro Lab 2200 
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(Hamilton, Reno, Nev.), Peltier Thermal Cycler (PTC200; 
MJ Research, Watertown Mass.) and the Applied Biosys- 
tems Catalyst 800 and 377 and 373 DNA sequencers. 

The quality of any particular cDNA library may be 
determined by performing a pilot scale analysis of 192 5 
cDNAs and checking for percentages of clones containing 
vector, lambda or E. coli DNA, mitochondrial or repetitive 
DNA, and clones with exact or homologous matches to 
public databases. The number of unique sequences — those 
having no known match in any available database — were 10 
recorded. 

Ill Homology Searching of cDNA Clones and Their 
Deduced Proteins 

Each sequence so obtained was compared to sequences in 
GenBank using a search algorithm developed by Applied 15 
Biosystems and incorporated into the INHERIT™ 670 
Sequence Analysis System. In this algorithm, Pattern Speci- 
fication Language (TRW Inc, Los Angeles, Calif.) was used 
to determine regions of homology. The three parameters that 
determine how the sequence comparisons run were window 20 
size, window offset, and error tolerance. Using a combina- 
tion of these three parameters, the DNA database was 
searched for sequences containing regions of homology to 
the query sequence, and the appropriate sequences were 
scored with an initial value. Subsequently, these homolo- 25 
gous regions were examined using dot matrix homology 
plots to distinguish regions of homology from chance 
matches. Smith-Waterman alignments were used to display 
the results of the homology search. 

Peptide and protein sequence homologies were ascer- 30 
tained using the INHERIT™ 670 Sequence Analysis System 
in a way. similar to that used in DNA sequence homologies. 
Pattern Specification Language and parameter windows 
were used to search protein databases for sequences con- 
taining regions of homology which were scored with an 35 
initial value. Dot-matrix homology plots were examined to 
distinguish regions of significant homology from chance 
matches. 

Alternatively, BLAST, which stands for Basic Local 
Alignment Search Tool, is used to search for local sequence 40 
alignments (Altschul S. F. (1993) J Mol Evol 36:290-300; 
Altschul, S. E et al (1990) J Mol Biol 215:403-10). BLAST 
produces alignments of both nucleotide and amino acid 
sequences to determine sequence similarity. Because of the 
local nature of the alignments, BLAST is especially useful 45 
in determining exact matches or in identifying homologs. 
While it is useful for matches which do not contain gaps, it 
is inappropriate for performing motif-style searching. The 
fundamental unit of BLAST algorithm output is the High- 
scoring Segment Pair (HSP). 50 

An HSP consists of two sequence fragments of arbitrary 
but equal lengths whose alignment is locally . maximal and 
for which the alignmentBLAST approach is to look thresh- 
old or cutoff score set by the user. The BLAST approach is 
to look for HSPs between a query sequence and a database 55 
sequence, to evaluate the statistical significance of any 
matches found, and to report only those matches which 
satisfy the user-selected threshold of significance. The 
parameter E establishes the statistically significant threshold 
for reporting database sequence matches. E is interpreted as 60 
the upper bound of the expected frequency of chance 
occurrence of an HSP (or set of HSPs) within the context of 
the entire database search. Any database sequence whose 
match satisfies E is reported in the program output. 

All the kinase molecules presented in this application 65 
were examined using INHERIT. Although their identifica- 
tion was based on the criteria above, their homology to 
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known kinase molecules and name are subject to change 
when additional computer analysis against additional or 
more recent database information is employed. For example, 
whereas the first two kinases in Table 1 were initially 
identified as unique Incyte clones, homologous mouse and 
human kinases are now known. In other cases, additional 
sequence information has become available and its review 
against the known databases has precipitated a name change. 
Occasionally a clone number will also disappear from the 
LIFESEQ™ database (Incyte Pharmaceuticals Inc, Palo 
Alto, Calif.). This situation generally arises during the 
regular review of clones and assembly of contiguous 
sequences. 

IV Extension of cDNAs to Full Length 

The kinase sequences presented here can be used to 
design oligonucleotide primers for the extension of the 
cDNAs to full length. In fact, the partial map kinase cDNA 
sequence (SEQ ID NO 38) initially identified in Incyte clone 
214915 among the sequences comprising the human stom- 
ach cell library was extended to full length as shown in "A 
Novel Human Map Kinase Homolog" by Hawkins et al. 
Incyte Docket PF-036P, filed on Jun. 28, 1995, incorporated 
herein by reference. The coding region of this full length 
sequence (SEQ ID NO 45; Incyte Clone 214915E) begins at 
nucleotide 58 and ends at nucleotide 1156. 

Primers are designed based on known sequence; one 
primer is synthesized to initiate extension in the antisense 
direction (XLR) and the other to extend sequence in the 
sense direction (XLF). The primers allow the sequence to be 
extended "outward" generating amplicons containing new, 
unknown nucleotide sequence for the gene of interest. The 
primers may be designed using Oligo 4.0 (National Bio- 
sciences Inc, Plymouth, Minn.), or another appropriate 
program, to be 22-30 nucleotides in length, to have a GC 
content of 50% or more, and to anneal to the target sequence 
at temperatures about 68°-72° C. Any stretch of nucleotides 
which would result in hairpin structures and primer-primer 
dimerizations was avoided. 

The stomach cDNA library was used as a template, and 
XLR-AAG ACA TCC AGG AGC CCA ATG AC and 
XLF-AGG TGA TCC TCA GCT GGA TGC AC primers 
were used to extend and amplify the 214915 sequence. By 
following the instructions for the XL-PCR kit and thor- 
oughly mixing the enzyme and reaction mix, high fidelity 
amplification is obtained. Beginning with 25 pMol of each 
primer and the recommended concentrations of all other 
components of the kit, PCR is performed using the Peltier 
Thermal Cycler (PTC200; MJ Research, Watertown, Mass.) 
and the following parameters: 

Step 1 94° C. for 60 sec (initial denaturation) 
: Step 2 94° C. for 15 sec 
Step 3 65° C. for 1 min 
Step 4 68° C. for 7 min 
Step 5 Repeat step 2-4 for 15 additional cycles 
Step 6 94° C. for 15 sec 
Step 7 65° C. for 1 min 
Step 8 68° C. for 7 min+15 sec/cycle 
Step 9 Repeat step 6-8 for 11 additional cycles 
Step 10 72° C. for 8 min 
Step 11 4° C. (and holding) 

At the end of 28 cycles, 50 fA of the reaction mix was 
removed; and the remaining reaction mix was run for an 
additional 10 cycles as outlined below: 

Step 1 94° C. for 15 sec 

Step 2 65° C. for 1 min 
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Step 3 68° C. for (10 min+15 sec)/cyc!e 
Step 4 Repeat step 1-3 for 9 additional cycles 
Step 5 72° C. for 10 min 

A 5-10 p\ aliquot of the reaction mixture is analyzed by 
electrophoresis on a low concentration (about 0.6-0.8%) 
agarose mini-gel to determine which reactions were suc- 
cessful in extending the sequence. Although all extensions 
potentially contain a full length gene, some of the largest 
products or bands are selected and cut out of the gel Further 
purification involves using a commercial gel extraction 
method such as QIAQuick™ (QIAGEN Inc). After recovery 
of the DNA, Klenow enzyme is used to trim single-stranded, 
nucleotide overhangs creating blunt ends which facilitate 
religation and cloning. 

After ethanol precipitation, the products are redissolved in 
13 /d of ligation buffer. Then, 1 fi\ T4-DNA ligase (15 units) 
and 1 fi\ T4 polynucleotide kinase are added, and the mixture 
is incubated at room temperature for 2-3 hours or overnight 
at 16° C. Competent E. coli cells (in 40 of appropriate 
media) are transformed with 3 /d of ligation mixture and 
cultured in 80 /d of SOC medium (Sambrook J. et al, supra). 
After incubation for one hour at 37° C, the whole transfor- 
mation mixture is plated on Luria Bertani (LB)-agar 
(Sambrook J. et al, supra) containing 2xCarb. The following 
day, 12 colonies are randomly picked from each plate and 
cultured in 150 /d of liquid LB/2xCarb medium placed in an 
individual well of an appropriate, commercially-available, 
sterile 96-well microliter plate. The following day, 5 of 
each overnight culture is transferred into a non-sterile 
96-well plate and after dilution 1:10 with water, 5 /d of each 
sample is transferred into a PCR array. 

For PCR amplification, 15 /d of concentrated PCR reac- 
tion mix (1.33x) containing 0.75 units of Taq polymerase, a 
vector primer and one or both of the gene specific primers 
used for the extension reaction are added to each well. 
Amplification is performed using the following conditions: 

Step 1 94° C. for 60 sec 

Step 2 94° C. for 20 sec 

Step 3 55° C. for 30 sec 

Step 4 72° C for 90 sec 

Step 5 Repeat steps 2-4 for an additional 29 cycles 
Step 6 72° C. for 180 sec 
Step 7 4° C. (and holding) 

Aliquots of the PCR reactions are run on agarose gels 
together with molecular weight markers. The sizes of the 
PCR products are compared to the original partial cDNAs, 
and appropriate clones are selected, ligated into plasmid and 
sequenced. 

V Diagnostic Assays Using Kinase Specific Oligomers 

In those cases where a specific disorder or disease (see 
definitions supra) is suspected to involve altered quantities 
of a particular kinase, oligomers may be designed to estab- 
lish the presence and/or quantity of mRNA expressed in a 
biological sample. There are several methods currently 
being used to quantitate the expression of a particular 
molecule. Most of these methods use radiolabelled (Melby 
P. C. et al 1993 J Immunol Methods 159:235-44) or bioti- 
nylated (Duplaa C. et al 1993 Anal Biochem 229-36) 
nucleotides, coamplification of a control nucleic acid, and 
standard curves onto which the experimental results are 
interpolated. For example, phosphorylase B kinase defi- 
ciency may manifest as hepatomegaly which is inherited as 
either an X-linked or autosomal recessive trait or myoglo- 
binuria whose inheritance is unknown. 

Oligomers for phosphorylase B kinase are first used in 
quantitative PCR to establish a normal range for expression 
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of phosphorylase B kinase. Then, these same oligomers are 
used with extracts of cells from patients with inherited 
phosphorylase B kinase deficiency. The information from 
such studies is used to define different inheritance patterns 
5 and to diagnose future patients displaying phosphorylase B 
kinase deficiency-like symptoms. In like manner, this same 
assay can be used to monitor progress of the patient as 
his/her physiological situation moves toward the normal 
range during therapy for the condition. 

VI Kinases Kit 

The kinases of the subject invention are used to produce 
a kinases kit for diagnosing disorders or diseases associated 
with altered kinase expression. This involves the designing 
a plurality of oligomers, one set of which is specific for each 
kinase or kinase regulatory sequence. Specificity in this case 

15 refers to sequence similarity, to the length of the nucleic acid 
molecule amplified, to cell or tissue type being screened or 
to the disorder or disease. These oligomers are combined 
with a biological sample obtained from a patient in a 
solution sufficient for PCR and amplified. The PCR products 

20 are examined first, to detect the expression of each kinase, 
and second to quantify the expression of each kinase. Kinase 
expression is compared with standard ranges for normal and 
abnormal expression. In the case(s) where kinase expression 
is altered, use of the kit has provided the physician with a 

25 named disorder or disease which can be treated or further 
investigated. 

A further use of the oligomers from the kinases kit is in 
a diagnostic assay of example V (above) used to monitor 
patient response to drug therapy. Once the disease has been 

30 named and a therapy chosen, the oligomers specific to the 
patient's disease may be used periodically to monitor the 
efficacy of the chosen therapy. In this case, the specific 
oligomers are combined with a biological sample from the 
patient in a solution sufficient for PCR and amplified. The 

35 PCR product is quantified and compared with a normal 
standard and with the pretreatment profile of the patient. If 
the kinase expression is tending toward normal, the therapy 
may be considered effective; if the expression is even more 
abnormal, therapy should be discontinued and an alternative 

40 treatment instituted. 

VII Sense or Antisense Molecules 

Knowledge of the correct cDNA sequence of any particu- 
lar kinase, its regulatory elements or parts thereof will 
enable its use as a tool in sense (Youssoufian H. and H. F. 

45 Lodish 1993) Mol Cell Biol 13:98-104) or antisense 
(Eguchi et al (1991) Annu Rev Biochem 60:631-652) tech- 
nologies for the investigation of gene function. 
Oligonucleotides, from genomic or cDNAs, comprising 
either the sense or the antisense strand of the cDNA 

50 sequence can be used in vitro or in vivo to inhibit expression. 
Such technology is now well known in the art, and oligo- 
nucleotides or other fragments can be designed from various 
locations along the sequences. 
The gene of interest can be turned off in the short term by 

55 transfecting a cell or tissue with expression vectors which 
will flood the cell with sense or antisense sequences until all 
copies of the vector are disabled by endogenous nucleases. 
Stable transfection of appropriate germ line cells or prefer- 
ably a zygote with a vector containing the fragment will 

60 produce a transgenic organism (U.S. Pat. No. 4,736,866, 12 
Apr. 1988), which produces enough copies of the sense or 
antisense sequence to significantly compromise or entirely 
eliminate normal activity of the particular kinase gene. 
Frequently, the function of the gene can be ascertained by 

65 observing behaviors such as lethality, loss of a physiological 
pathway, changes in morphology, etc. at the intracellular, 
cellular, tissue or organism al level. 
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Id addition to using fragments constructed to interrupt 
transcription of the open reading frame, modifications of 
gene expression can be obtained by designing antisense 
sequences to promoters, enhancers, introns, or even to 
trans-acting regulatory genes. Similarly, inhibition can be 
achieved using Hogeboom base-pairing methodology, also 
known as "triple helix" base pairing. 



of interest. They include MMTV, SV40, and metallothionine 
promoters for CHO cells; trp, lac, tac and T7 promoters for 
bacterial hosts; and alpha factor, alcohol oxidase and PGH 
promoters for yeast. In addition, transcription enhancers, 
such as the rous sarcoma virus (RSV) enhancer, may be used 
in mammalian host cells. Once homogeneous cultures of 
recombinant cells are obtained through standard culture 
methods, large quantities of recombinantly produced peptide 
can be recovered from the conditioned medium and ana- 



VIII Expression of Kinases 

Expression of the kinases may be accomplished by sub- 
cloning the cDNAs into appropriate vectors and transfecting io lyzed using methods known in the art. 
the vectors into host cells. In some cases, the cloning vector IX Isolation of Recombinant KIN 

previously used for the generation of the tissue library also KIN may be expressed as a recombinant protein with one 

provides for direct expression of kinase sequences in E. coli. or more additional polypeptide domains added to facilitate 

Upstream of the cloning site, this vector contains a promoter protein purification. Such purification facilitating domains 

for p-galactosidase, followed by sequence containing the 15 include, but are not limited to, metal chelating peptides such 



as histidine-tryptophan modules that allow purification on 
immobilized metals, protein A domains that allow purifica- 
tion on immobilized immunoglobulin, and the domain uti- 
lized in the FLAGS extension/affinity purification system 
(Immunex Corp, Seattle, Wash.). The inclusion of a cleav- 
able linker sequence such as Factor XA or enterokinase 
(Invitrogen) between the purification domain and the kin 
sequence may be useful to facilitate expression of KIN. 
X Testing for Kinase Activity 

The sequences in this application represent many different 
domains of different kinase families. These domains (and 
subdomains as detailed in the background of the invention) 
may be utilized: 1) individually for the production of 
antibodies, 2) in functional groups (eg. to span a membrane), 



amino-terminal Met and the subsequent 7 residues of 
p-galactosidase. Immediately following these eight residues 
is a bacteriophage promoter useful for transcription and a 
linker containing a number of unique restriction sites. 

Induction of an isolated, transfected bacterial strain with 20 
IPTG using standard methods will produce a fusion protein 
corresponding to the first seven residues of P-galactosidase, 
about 5 to 15 residues which correspond to linker, and the 
peptide encoded within the kinase cDNA. Since cDNA 

clone inserts are generated by an essentially random process, 25 
there is one chance in three that the included cDNAwill lie 
in the correct frame for proper translation. If the cDNAis not 
in the proper reading frame, it can be obtained by deletion 

or insertion of the appropriate number of bases by well . , _ . _ ... 

known methods including in vitro mutagenesis, digestion 30 and 3) as interchangable, usable parts of a chimeric kinase, 

with exonuclease III or mung bean nuclease, or oligonucle- The various partial cDNA sequences of this application 

otide linker inclusion. represent the different kinase domains of the various fami- 

The kinase cDNA can be shuttled into other vectors lies (Hardie G. and Hanks S., supra), and they may be 

known to be useful for expression of protein in specific recombined in numerous ways to produce chimeric nucleic 

hosts. Oligonucleotide linkers containing cloning sites as 35 acid molecules. For example, a known, full length kinase 

well as a stretch of DNA sufficient to hybridize to the end of such as the human map kinase of this application (Seq ID No 

the target cDNA (25 bases) can be synthesized chemically 45) may be used to swap related portions of the nucleic acid 

by standard methods. These primers can then used to sequence, analogous to domains or subdomains of MAP 

amplify the desired gene fragments by PCR. The resulting kinase polypeptides. The chimeric nucleotides, so produced, 

fragments can be digested with appropriate restriction 40 may be introduced into prokaryotic host cells (as reviewed 

enzymes under standard conditions and isolated by gel in Strosberg A. D. and Marullo S. (1992) Trends PharmaSci 

electrophoresis. Alternatively, similar gene fragments can be 13:95-98) or eukaryotic host cells. These host cells are then 

produced by digestion of the cDNA with appropriate restric- employed in procedures to determine what molecules acti- 

tion enzymes and filling in the missing gene sequence with vate the kinase or what molecules are activated by a kinase, 

chemically synthesized oligonucleotides. Partial nucleotide 45 Such activating or activated molecules may be of 

sequence from more than one gene can be ligated together extracellular, intracellular, biologic or chemical origin, 

and cloned in appropriate vectors to optimize expression. An example of a test system, in this case for protein 

Suitable expression hosts for such chimeric molecules tyrosine kinases, can be based on the interaction of protein 

include but are not limited to mammalian cells such as tyrosine kinases with chemokine receptors (Taniguchi T. 

Chinese Hamster Ovary (CHO) and human 293 cells, insect 50 (1995) Science 268:251-255). These receptors are capable 

cells such as Sf9 cells, yeast cells such as Saccharomyces of activating a variety of nonreceptor protein tyrosine 

cerevisiae, and bacteria such zsE.colL For each of these cell kinases when stimulated by an extracellular chemokine. 

systems, a useful expression vector may also include an C-X-C chemokines such as platelet factor 4, interleukin-8, 

origin of replication to allow propagation in bacteria and a connective tissue activating protein III, neutrophil activating 

selectable marker such as the p-lactamase antibiotic resis- 55 peptide 2, are soluble activators of neutrophils, 

tance gene to allow selection in bacteria. In addition, the A standard measure of neutrophil activation involves 

vectors may include a second selectable marker such as the measuring the mobilization of Ca""" as part of the signal 

neomycin phosphotransferase gene to allow selection in transduction pathway. The experiment involves several 

transfected eukaryotic host cells. Vectors for use in eukary- steps. First, blood cells obtained from venipuncture are 

otic expression hosts may require RNA processing elements 60 fractionated by centrifugation on density gradients. Enriched 

such as y polyadenylation sequences if such are not part of populations of neutrophils are further fractionated on col- 

the cDNA of interest. umns by negative selection using antibodies specific for 

Additionally, some of the kinase vectors may contain other blood cells types. Next, neutrophils are transformed 

native promoters which will allow induction of gene expres- with an expression vector containing the kinase nucleic acid 

sion in human cells such as the 293 line mentioned above. 65 sequence of interest and preloaded fluorescent probe whose 

Other available promoters are host specific and may be emission characteristics have been altered by Ca"~" binding, 

specifically combined with the coding region of the kinase Or in the alternative, the neutrophil is preloaded with the 
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purified kinase of interest and fluorescent probe. Then, when 
the cells are exposed to an appropriate chemokine, the 
chemokine receptor activates the kinase which, in turn, 
initiates Ca** flux. Ca** mobilization is observed and mea- 
sured using fluorometry as has been described in Grynk- 
ievicz G. et al (1985) J Biol Chem 260:3440, and McCoU S. 
et al (1993) J Immunol 150:4550-4555, incorporated herein 
by reference. 

XI Identification of or Production of Kinase Specific Anti- 
bodies 

Purified KIN is used to screen a pre-existing antibody 
library or to raise antibodies, using either polyclonal or 
monoclonal methodology. For polyclonal antibody 
production, denatured peptide from the reverse phase HPLC 
separation is obtained in quantities up to 75 mg. This 
denatured protein can be used to immunize mice or rabbits 
using standard protocols; about 100 micrograms are 
adequate for immunization of a mouse, while up to 1 mg 
might be used to immunize a rabbit. In identifying mouse 
hybridomas, the denatured protein can be labelled and used 
to screen potential murine B-cell hybridomas for those 
which produce antibody. This procedure requires only small 
quantities of protein, such that 20 mg would be sufficient for 
labelling and screening of several thousand clones. 

For monoclonal antibody production, the amino acid 
sequence, as deduced from translation of the cDNA, is 
analyzed to determine regions of high immunogenicity. 
Peptides comprising appropriate hydrophilic regions are 
expressed from recombinant cDNAor synthesized and used 
in suitable immunization protocols to raise antibodies. 
Selection of appropriate epitopes is described by Ausubel F. 
M et al (supra). The optimal amino acid sequences for 
immunization are usually located at the C-terminus or 
N-terminus and in intervening, hydrophilic regions of the 
polypeptide which are likely to be exposed to the external 
environment when the protein is in its natural conformation. 

Typically, selected oligopeptides, about 15 residues in 
length, are synthesized using an Applied Biosystems Peptide 
Synthesizer Model 431Ausing finoc-chemistry and coupled 
to keyhole limpet hemocyanin (KLH, Sigma) by reaction 
with M-maleimidobenzoyl-N-hydroxysuccinimide ester 
(MBS; Ausubel F. M. et al, supra). If necessary, a cysteine 
may be introduced at the N-terminus of the peptide to permit 
coupling to KLH. Rabbits are immunized with the peptide- 
KLH complex in complete Freund's adjuvant. The resulting 
antisera are tested for antipeptide activity by binding the 
peptide to plastic, blocking with 1% bovine serum albumin, 
reacting with antisera, washing and reacting with labelled, 
affinity purified, specific goat anti-rabbit IgG. 

Hybridomas may also be prepared and screened using 
standard techniques. Hybridomas of interest are detected by 
screening with labelled KIN to identify those fusions pro- 
ducing the monoclonal antibody with the desired specificity. 
In a typical protocol, wells of plates (FAST; Becton- 
Dickinson, Palo Alto, Calif.) are coated during incubation 
with affinity purified, specific rabbit anti-mouse (or suitable 
anti-species Ig) antibodies at 10 mg/ml. The coated wells are 
blocked with 1% BSA, washed and incubated with super- 
natants from hybridomas. After washing the wells are incu- 
bated with labelled KIN at 1 mg/ml. Supernatants with 
specific antibodies bind more labelled KIN than is detectable 
in the background. Then clones producing specific antibod- 
ies are expanded and subjected to two cycles of cloning at 
limiting dilution. Cloned hybridomas are injected into 
pristane-treated mice to produce ascites, and monoclonal 
antibody is purified from mouse ascitic fluid by affinity 
chromatography on Protein A. Monoclonal antibodies with 



affinities of at least ltffM, preferably l(f to 10 10 or stronger, 
will typically be made by standard procedures as described 
in Harlow and Lane (1988) Antibodies: A Laboratory 
Manual, Cold Spring Harbor Laboratory, Cold Spring 

5 Harbor, N.Y.; and in Goding (1986) Monoclonal Antibodies: 
Principles and Practice, Academic Press, New York City, 
both incorporated herein by reference. 
XII Diagnostic Assays Using KIN Specific Antibodies 
Particular KIN antibodies are use fill for investigation of 

10 various disorders or diseases which may be characterized by 
differences in the amount or distribution of KIN. Given the 
usual role of the kinases, KIN might be expected to be 
upregulated (or downregulated) in its involvement in acti- 
vation of signal cascades. 

15 Diagnostic assays for KIN include methods utilizing the 
antibody and a reporter molecule to detect KIN in human 
body fluids, membranes, cells, tissues or extracts thereof. 
The antibodies of the present invention may be used with or 
without modification. Frequently, the antibodies will be 

20 labelled by joining them, either covalently or noncovalently, 
with a substance which provides for a detectable signal. A 
wide variety of reporter molecules and conjugation tech- 
niques are known and have been reported extensively in 
both the scientific and patent literature. Suitable reporter 

25 molecules or labels include those radionuclides, enzymes, 
fluorescent, chemi-luminescent, or chromogenic agents pre- 
viously mentioned as well as substrates, cefaclors, 
inhibitors, magnetic particles and the like. Patents teaching 
the use of such labels include U.S. Pat. Nos. 3,817,837; 

30 3,850,752; 3,939,350; 3,996,345; 4,277,437; 4,275,149; and 
4,366,241. Also, recombinant immuno-globulins may be 
produced as shown in U.S. Pat. No. 4,816,567, incorporated 
herein by reference. 
A variety of protocols for measuring soluble or 

35 membrane-bound KIN, using either polyclonal or mono- 
clonal antibodies specific for the protein, are known in the 
art. Examples include enzyme-linked immunosorbent assay 
(ELISA), radioimmunoassay (RIA) and fluorescent acti- 
vated cell sorting (FACS). A two-site monoclonal-based 

40 immunoassay utilizing monoclonal antibodies reactive to 
two non-interfering epitopes on KIN is preferred, but a 
competitive binding assay may be employed. These assays 
are described, among other places, in Maddox, D. E. et al 
(1983, J Exp Med 158:1211). 

45 XIII Purification of Native KIN Using Antibodies 

Native or recombinant protein kinases can be purified by 
immunoaffinity chromatography using antibodies specific 
for that particular KIN. In general, an immunoafBnity col- 
umn is constructed by covalently coupling the anti-KIN 

50 antibody to an activated chromatographic resin. 

Polyclonal immunoglobulins are prepared from immune 
sera either by precipitation with ammonium sulfate or by 
purification on immobilized Protein A (Pharmacia Biotech). 
Likewise, monoclonal antibodies are prepared from mouse 

55 ascites fluid by ammonium sulfate precipitation or chroma- 
tography on immobilized Protein A. Partially purified immu- 
noglobulin is covalently attached to a chromatographic resin 
such as CnBr-activated Sepharose (Pharmacia Biotech). The 
antibody is coupled to the resin, the resin is blocked, and the 

60 derivative resin is washed according to the manufacturer's 
instructions. 

Such immunoaffinity columns may be utilized in the 
purification of KIN by preparing a fraction from cells 
containing KIN in a soluble form. This preparation may be 
65 derived by solubilization of whole cells or of a subcellular 
fraction obtained via differential centrifugation (with or 
without addition of detergent) or by other methods well 
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known in the art. Alternatively, soluble KIN containing a 
signal sequence may be secreted in useful quantity into the 
medium in which the cells are grown. 

A soluble KIN-containing preparation is passed over the 
immunoaffinity column, and the column is washed under 5 
conditions that allow the preferential absorb ance of KIN (eg, 
high ionic strength buffers in the presence of detergent). 
Then, the column is eluted under conditions that disrupt 
antibody/KIN binding (eg, a buffer of pH 2-3 or a high 
concentration of a chaotrope such as urea or thiocyanate 
ion), and KIN is collected. 10 

XIV Drug Screening 

This invention is particularly useful for screening thera- 
peutic compounds by using binding fragments of KIN in any 
of a variety of drug screening techniques. The molecules to 
be screened may be of extracellular, intracellular, biologic or 15 
chemical origin. The peptide fragment employed in such a 
test may either be free in solution, affixed to a solid support, 
borne on a cell surface or located intracellularly. One may 
measure, for example, the formation of complexes between 
KIN and the agent being tested. Alternatively, one can 
examine the diminution in complex formation between KIN 
and a receptor caused by the agent being tested. 

Methods of screening for drugs or any other agents which 
can affect signal transduction comprise contacting such an 
agent with KIN fragment and assaying for the presence of a 
complex between the agent and the KIN fragment. In such 25 
assays, the KIN fragment is typically labelled. After suitable 
incubation, free KIN fragment is separated from that present 
in bound form, and the amount of free or uncomplexed label 
is a measure of the ability of the particular agent to bind to 
KIN. ^ 30 

Another technique for drug screening provides high 
throughput screening for compounds having suitable bind- 
ing affinity to the KIN polypeptides and is described in detail 
in European Patent Application 84/03564, published on Sep. 
13, 1984, incorporated herein by reference. Briefly stated, 35 
large numbers of different small peptide test compounds are 
synthesized on a solid substrate, such as plastic pins or some 
other surface. The peptide test compounds are reacted with 
KIN fragment and washed. Bound KIN fragment is then 
detected by methods well known in the art. Purified KIN can 
also be coated direcdy onto plates for use in the aforemen- 40 
tioned drug screening techniques. In addition, non- 
neutralizing antibodies can be used to capture the peptide 
and immobilize it on the solid support. 

This invention also contemplates the use of competitive 
drug screening assays in which neutralizing antibodies 45 
capable of binding KIN specifically compete with a test 
compound for binding to KIN fragments. In this manner, the 
antibodies can be used to detect the presence of any peptide 
which shares one or more antigenic determinants with KIN. 

XV Identification of Molecules Which Interact with KIN 50 
The inventive purified KIN is a research tool for 

identification, characterization and purification, of 
interacting, signal transduction pathway proteins. Appropri- 
ate labels are incorporated into KIN by various methods 
known in the art and KIN is used to capture soluble or 55 
interact with membrane-bound molecules. A preferred 
method involves labeling the primary amino groups in KIN 
with 125 I Bolton-Hunter reagent (Bolton, A. E. and Hunter, 
W. M. (1973) Biochem J 133:529). This reagent has been 
used to label various molecules without concomitant loss of 
biological activity (Hebert C. A. et al (1991) J Biol Chem 60 
266:18989-94; McColl S. et al (1993) J Immunol 
150:4550-4555). Membrane-bound molecules are incubated 
with the labelled KIN molecules, washed to removed 
unbound molecules, and the KIN complex is quantified. 
Data obtained using different concentrations of KIN are used 65 
to calculate values for the number, affinity, and association 
of KIN with the signal transduction complex. 



Labelled KIN fragments are also useful as a reagent for 
the purification of molecules with which KIN interacts, 
specifically including inhibitors. In one embodiment of 
affinity purification, KIN is covalently coupled to a chro- 
matography column. Cells and their membranes are 
extracted, KIN is removed and various KIN-free subcom- 
ponents are passed over the column. Molecules bind to the 
column by virtue of their KIN affinity. The KIN-complex is 
recovered from the column, dissociated and the recovered 
molecule is subjected to N-terminal protein sequencing. 
This amino acid sequence is then used to identify the 
captured molecule or to design degenerate oligomers for 
cloning its gene from an appropriate cDNA library. 

In an alternate method, monoclonal antibodies raised 
against KIN fragments are screened to identify those which 
inhibit the binding of labelled KIN. These monoclonal 
antibodies are then used in affinity purification or expression 
cloning of associated molecules. Other soluble binding 
molecules are identified in a similar manner. Labelled KIN 
is incubated with extracts or other appropriate materials 
derived from rheumatoid synovium. After incubation, KIN 
complexes (which are larger than the lone KIN fragment) are 
identified by a sizing technique such as size exclusion 
chromatography or density gradient centrifugation and are 
purified by methods known in the art. The soluble binding 
protein(s) are subjected to N-terminal sequencing to obtain 
information sufficient for database identification, if the 
soluble protein 'is known, or for cloning, if the soluble 
protein is unknown. 

XVI Use and Administration of Antibodies or Other Inhibi- 
tory Molecules 

Antibodies, inhibitors, receptors or antagonists of KIN 
fragments (or other treatments to limit signal transduction, 
TS1), can provide different effects when administered thera- 
peutically. TSTs will be formulated in a nontoxic, inert, 
pharmaceutically acceptable aqueous carrier medium pref- 
erably at a pH of about 5 to 8, more preferably 6 to 8, 
although the pH may vary according to the characteristics of 
the antibody, inhibitor, or antagonist being formulated and 
the condition to be treated. Characteristics of TSTs include 
solubility of the molecule, half-life and antigenicity/ 
immunogenicity; these and other characteristics may aid in 
defining an effective carrier. Native human proteins are 
preferred as TSTs, but organic or synthetic molecules result- 
ing from drug screens may be equally effective in particular 
situations. 

TSTs may be delivered by known routes of administration 
including but not limited to topical creams and gels; trans- 
mucosal spray and aerosol; transdermal patch and bandage; 
injectable, intravenous and lavage formulations; and orally 
administered liquids and pills particularly formulated to 
resist stomach acid and enzymes. The particular 
formulation, exact dosage, and route of administration will 
be determined by the attending physician and will vary 
according to each specific situation. . 

Such determinations are made by considering multiple 
variables such as the condition to be treated, the TST to be 
administered, and the pharmacokinetic profile of the par- 
ticular TST. Additional factors which may be taken into 
account include disease state (e.g. severity) of the patient, 
age, weight, gender, diet, time and frequency of 
administration, drug combination, reaction sensitivities, and 
tolerance/response to therapy. Long acting TST formula- 
tions might be administered every 3 to 4 days, every week, 
or once every two weeks depending on half-life and clear- 
ance rate of the particular TST. 

Normal dosage amounts may vary from 0.1 to 100,000 
micrograms, up to a total dose of about 1 g, depending upon 
the route of administration. Guidance as to particular dos- 
ages and methods of delivery is provided in the literature. 
See U.S. Pat. No. 4,657,760; 5,206,344; or 5,225,212. Those 
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skilled in the art will employ different formulations for 
different TSTs. Administration to cells such as nerve cells 
necessitates delivery in a manner different from that to other 
cells such as vascular endothelial cells. 

It is contemplated that disorders or diseases which trigger 
defensive signal transduction may precipitate damage that is 
treatable with TSTs. These disorders or diseases may be 
specifically diagnosed by the tests discussed above, and such 
testing should be performed in cases where physiologic or 
pathologic problems are suspected to be associated with 
abnormal signal transduction. 

All publications and patents mentioned in the above 
specification are herein incorporated by reference. Various 
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modifications and variations of the described method and 
system of the invention will be apparent to those skilled in 
the art without departing from the scope and spirit of the 
invention. Although the invention has been described in 

5 connection with specific preferred embodiments, it should 
be understood that the invention as claimed should not be 
unduly limited to such specific embodiments. Indeed, vari- 
ous modifications of the above-described modes for carrying 
out the invention which are obvious to those skilled in the 

10 field of molecular biology or related fields are intended to be 
within the scope of the following claims. 



TABLE 1 



Qonc 


library 


GenBank/SwissProt Identifier, Name 


297 


U937 


PO0540 Mouse protooncogene ser/thr kinase 


1622 


U937 


HUMCLK3B dk3 gene product 


10007 


THP-1 Phorbol LPS 


HSPLK1 protein kinase 


12702 


THP-1 Phorbol LPS 


RATSGPK ser/thr kinase 


23789 


Inflamed Adenoid 


CHKFRNK chicken tyr kinase 


35652 


HUVEC 


KEK5 Chicken Y kinase receptor 


35855 


HUVEC 


HUMANBTK37 tyr kinase 


40194 


T + B Lymphoblast 


KRB1 VAKV Variola virus protein jonase 


42170 


T + B Lymphoblast 


H3Uuy_>o4 senne Kinase 


acc\q-\ 
4o0ol 


(Jo meal otroma 


To^JUi^J yeast protein Kinase 


46651 




CDK4 P11802 


53840 


Fibroblast 


HSDAPK. Death-associated orotein kinase 


54065 


Pfhfnhlftst 


SCPROKIN 1 veast 35.6 kD 


56494 


Fibroblast 


KLMC RAT, myosin light chain kinase 


58029 


Skeletal Muscle 


ATHCTRIA 1 A. Thaliana Y kinase receptor 


64663 


Placenta 


KIN3 Yeast protein kinase P22209 


67967 


HUVEC Sheer Stress . 


YAX1 Yeast protein kinase 


68963 


HUVEC Sheer Stress 


KATK Human Y kinase 


71904 


Placenta 


KIN3 P22209SwP . 


75289 


THP-1 Phorbol 


H5U0SO23 Avian retrovirus rpl30 


81865 


Rheumatoid Synovium 


SNT1 Yeast C catabolite derepressing 


82056 


HUVEC Sheer Stress 


P34314 C. elegans ser/thr kinase 


108485 


AML Blast 


KAPA Pig cAMP-dcpendent protein kinase 


.114973 


Testis 


CC2B ARATH Mouse-ear aess ede 


118591 


Skeletal Muscle 


PB0192 mixed lineage kinase 1 


119819 


Skeletal Muscle 


H5U09564 scr kinase 


120376 


Skeletal Muscle 


U01064 Y kinase 


132750 


Bone Marrow 


MLK2 mixed lineage kinase 2 


140052 


T Lymphocyte 


G-protein coupled receptor kinase 


146392 


T Lymphocyte 


SCYAK1 Yeast Yakl kinase 


156108 


THP-1 Phorbol LPS 


U01064 Dictyostelium Y kinase 


173627 


Bone Marrow 


MMU14166 Kiz 


181971 


Placenta 


HUMTKR Y kinase receptor 


182538 


Placenta 


HSNEK2R kinase 


184416* 


Cardiac Muscle 


KPKS Human proto-oncogene Ser/Thr kinase 


191283 


Rheumatoid Synovium 


RATSGPK Ser/Thr kinase 


192268 


Rheumatoid Synovium 


ATHAPK1A Ser/Thr kinase 


214915 


Stomach 


XLMPK2K Map kinase 


223163 


Pancreas 


TGF-p receptor ser/thr kinase 


237002 


Small Intestine 


PI 6227 Mouse Y kinase blk 


239990 


Hippocampus 


SHC Human transforming protein 


240142 


Hippocampus 


HSNEK2R 


275781 


Testes 


BOVCKIA casein kinase . 


285465 


Eosinophils 


DDIMLCK myosin light chain kinase 



SEQUENCE LISTING 



( 1 ) GENERAL INFORMATION: 

(Ml ) NUMBER OF SEQUENCES: 45 

( 2 ) INFORMATION FOR SEQ ID NO:l: 



( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 526 base paira 
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-continued 



( B ) TYPE: nndefe acid 

( C ) STRANDED NESS: single 

( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: U937 
( B ) CLONE: 297 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:l: 

ACAAGGGTTG TAATTAAAGG CGATTTTGAA ACAATTAAAA TCTGTGATGT AGGAGTCTCT 60 

CTACCACTGG ATGAAAATAT GACTGTGACT GACCCTGAGG CTTGTTACAT TGGC A C AG AG 120 

CCATGGAAAC CCAAAGAAGC TGTGGAGGAG AATGGTGTTA T*T ACTGCAAG GCAGACATAT 180 

TTGCCTTTGG CTTACTTTGT GGGAAATGAT GACTTTATCG ATTCCACACA TTAATCTTTC 240 

AAATGATGAT GATGATGAAG TAAAAACTTT TTGATGAAAA GTAATTTTGA TGTTGAAGCA 300 

TTACTATGCA AGCCCTTTGG ACCTAAGGCC ACCCTATTTT AATATTGGAG GACCTTGGTG 360 

AATCATACCC AGGAAGGTAA TTTGACCTCT TCTCTGATCA CCCTTATTGA AGCCCCCAAG 420 

CACCCTTCTT GTGACAATTT TAGGTTGGAC CAGTTGCTTT GGGCCAACTT AACTAAAGTT 480 

GTTCGAAAAA CTTTTTTCCA AAAATTTCCA TAGGCCTCCC AAGTTT 526 

w 

( 2 ) INFORMATION FOR SEQ ID NO:2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 378 base pain 
( B ) TYPE: nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: U937 
( B ) CLONE: 1622 

( x i ) SEQUENCE DESCRIPTION: SEQ D> NO:2: 

AGAACACCAC ATCCGAGTGG CTGACTTTGG CAGTGCCACA TTTGACCATG AGCACCACAC 60 

CACCATTGTG GCCACCCGTC ACT AT CGC CG CCTGAGGTGA TCCTTGAGCT GGGCTGGGCA 120 

CAGCCTGGTG ACGTCTGGGC AT TGGCTGC A TTCTCTTTGA GTACTACCGG GGCTTCACAC 180 

TCTTCCAGAC CCACGAAAAC CGAGAGCACC TGGTGATGAT GGAGAAGATC CTAGGGCCCA 240 

TCCCATCACA CATGATCCAC CGTACCAGGA AGCAGAATAT TTCTACAAAG GGGGCCTAGT 300 

TTGGGATGGA CAGCTCTTAC GGCCGGTATG TAAGGGACTC AAACCTTTAA GGTTCATGTT 360 

CAAGCTTCCT GGGAAGTG 378 

( 2 ) INFORMATION FOR SEQ ID NO:3: 

( i ) SEQUENCE CHARACTERISTICS: 

( A ) LENGTH: 326 base pairs . 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Pborbol LPS 
( B ) CLONE: 10007 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOJ: 

GGGCTGGCAG CCCGG TTGG A GCCTCCGGAG CAGAGGAAGA AGACCATCTT GGCACCCCCA 60 
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•continued 



ACTATGTGGC TCCAGAAGTG CTGCTGAGAC AGGGCCACGG CCCTGAGGCG GATGTATGGT 120 

CACTGGGCTG TGTCATGTAC ACGCTGCTCT GCGGGACCCT CCCTTTGAGA CGGCTGACCT 180 

GAAGGAGACG TACCGCTGCA TCAAGAAGGT TCACTACAAC GGTGCCTGCC AGCTCTTAAT 240 

GCCTGCCCGA GTCCTTGGCC GCAATCCTTC GGGCCTTAAC CCGAGAACCG GCCCTCTATT 3 0 0- 

GACAGATCCT TGCGGCAATT AACTTT 326 

( 2 ) INFORMATION FOR SEQ [D NO:4: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 257 base pain 
( B ) TYPE: nucleic ecid 
( C ) STRANDEDNESS; single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i j ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Pborbol LPS 
( B ) CLONE: 12702 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:4: 

CCGCAAGACA CCTCCTGGAG GGCCTCCTGA GAAGGACAGG CAAAGGGCTG GGCCAAGGAT 60 

GACTTCATGG AGATTAAGAG TCATGTTTCT TCTCCTTAAT TAACT'GGGAT GATCTCATTA 120 

ATAAGAAGAT TACTCCCCCT TTTACCCAAA TGTGAGTGGG CCCAACGCCT ACGGACTTTG 180 

CCCCGAGTTT ACGAAGAGCC TTCCCCAATC CATTGGAAGT CCCCTGAAAG GTCCTATACA 240 

AGTCAGTTAA GGAAGTT 257 

( 2 ) INFORMATION FOR SEQ ID NO:5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 252 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Inflamed Adenoid 
( B ) CLONE: 23789 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:5: 

GTGAAGAATG TGGGGCTGAC CCTCGGAAGT CATCGGGAGC GTGGATGATC TCCTGCCTTC 60 

CTTGCCGTCA TCTCACGGAC AGAGATCGAG GGCACCCAGA AACTGCTCAA CAAAGACCTG 120 

GCAGAGCTCA TCAACAAGAT GCGCTGGCGC AAGAACGCGT GACCTCCCTG TAGGAGTAAG 180 

AGGCAGATCT GACGGTTCAC AACCCTGGCT GTGACGC AAG A A CCTCTTAC GTGTGCCAGG 240 

CCCAAAGTTC TG 252 

( 2 ) INFORMATION FOR SEQ ID NO:6: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 255 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Hnvec 
( B ) CLONE: 35652 



( x i ) SEQUENCE DESCRIPTION: SEQ ID NO*: 



31 



5,817,479 

•continued 



32 



CAAAATCGTG GCCCGGAGAA TGGCGGGGCC TCAACCCTCT CCTGGACCAG CGGCAGCTCA 60 

CTACTCAGCT TTTGGCCTGT GGGCGAGTGG CTTCGGGCCA TCAAAATGGG AAGATACGAA 120 

GAAAGTTTCG CAGCCGCTGG CTTTGCCTCC TTCAGCTGGT CAGCCAGATC TCTGCTGAGG 180 

ACCTGCTCCG AATCGAGTCA CTCTGGCGGG ACACCAGAAG AAAATTTGGC CAGTTCCAGC 240 

ACATGAGTCC CAGGT 255 

( 2 ) INFORMATION FOR SEQ ID NO:7: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 238 base pans 
( B ) TYPE: Qudcic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Hnvec 
( B ) CLONE: 35855 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:7: 

GAATACCCCA TATACATAGT GACTGATATA TAAGCAATGG CTGCTTGCTG AATACCTGAG 60 

GAGTCACGGA AAAGGCTTAA CCTTCCCAGT CTTAGAAATG TGCTACGATG TCTGTAAGGC 120 

ATGGCCTTCT TGGAGAGTCA CCAATTCATA CACCGGGCTT GGCTGCTCGT AACTGCTTGG 180 

TGGACAGAGA TCTCTGTGTG AAAGTTCTCC ATTTGGATGA CAAGGTATGT TCTTGATG 238 

( 2 ) INFORMATION FOR SEQ ID NO:8: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 261 base pairs 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T+B Lympboblast 
( B ) CLONE: 40194 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:& 

AAACAACTTG ATTATTTAGG AATTCCTCTG TTTTATGGAT CTGGTCTGAC TGAATTCAAG 60 

GGAAGAAGTT ACAGATTTAT GGTAATGGAA AGACTAGGAA TAGATTTACA GAAGATCTCA 120 

GGCCAGAATG GTACCTTTAA AAAGTCAACT GTCCTGCAAT TAGGATCCGA ATGTTGGATG 180 

TACTGGAATA T A T A C A T G A A A AT G A AT A T G TTCATGGTGA TATAAAAGCA GCAAATCTAC 240 

TTTTGGGTTA CAAAAATCCT T 261 

( 2 ) INFORMATION FOR SEQ ID NOA 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 242 base pairs 
( B ) TYPE: nockic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T+B Lympboblast 
( B ) CLONE: 42170 

( x 1 ) SEQUENCE DESCRDPTION: SEQ ID NO-9: 
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TAACAAACCT GAAGATCGAG CCACTGCTGA AGAATGTCTA AAGCACCCCT GGTTGACACA 60 

* 

GAGCAGTATT CAAGAGCCTT CTTTCAGGAT GGAAAAGGCA CTAGAAGAAG CAAATGCCCT 120 

CCAAGAAGGT CATTCTGTGC CTGAAATTAA TTCGGATACC GACAAATCAG AAACCGAGGA 180 

ATCCATTGTA ACCGAAGAGT TAATTGTAGT TACTTCATAT ACTCTAGGGC AATGCAGACA 240 

GT 242 

( 2 ) INFORMATION FOR SEQ ID NO:lCh 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 222 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRAND EDNESS: aingle 
( D ) TOPOLOGY: lincai 

( i i ) MOLECULE TYPE; cONA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Corneal Stroma 
( B ) CLONE: 46081 

( i 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:10: 

GCAAAGGACA GTCCGCCGAG GTGCTCGGTG GAGTCATGGC ATTCCCTTTT GGAAGACTGG 6 0 

CCTTGGTGCA AACCCTGGAG AAGGTGCCTA TGGAGAAGTT CAACTTGCTG TAAATAGAGT 120 

AACTAAGAAG CAGTCGCAGT GAAGATTTAG ATATAAGCGT GCCGTAGACT GTCCCGAAAA 180 

TATTAAGTAG ATCTGTATCA ATAAAATGCT AATCATGAAA TT 222 

( 2 ) INFORMATION FOR SEQ ID NO:ll: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 225 base pain 
( B ) TYPE: nucleic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

(vi i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Corneal Stroma 
( B ) CLONE: 46651 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:U: 

ATGCTCCGCC AGTGAGAAGG GCGGCTGCCT GAGCGCCTCA CCAGTCCTCA TCACCCAGAT 60 

CCTGTGGCTT TGAGACACCT TCACTTAAGA ACATTTGCCA CTTGACTTAA ACCAGAAACG 120 

TGTTTTGTGG CATCAGCAGA CCCTTTCTCA GGTAAGTTGT GCTTTGCTTT TAG CAT A C G T 180 

GAGAAGTTGT TCCGCTCCAT TTTGTGGGAC GTCTTTCTTT CCTTG 225 

( 2 ) INFORMATION FOR SEQ ID Nttl2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 256 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE- cDNA 

( v 1 I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Fibroblast 
( B ) CLONE: 53840 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOU2: 
CAGCGCCTTA CATCTCGCAG CCAAGAACAG CCACCATGAA TGCATCAGGA AGCTGCTTCA 6 0 

TCTAAATGCC CAGCCGAAAG TTTTGACAGC TCTGGGAAAA CAGCTTTACA TTATGCAGCG 120 



• 
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GCTCAGCGCT GCCTTCAAGC TGTGCAGATT CTTGCGAACA CAAGAGCCCC ATAAACCTCA 
AAGATTTGGA TGGGAATATA CCGCTGCTGC TTGCTGTACA AAATGGTCAC AGTGAGATCT 
GTCACTTTTC CTGGTC 



1 S 0 

2 4 0 

2 5 6 



( 2 ) INFORMATION FOR SEQ ID NO:l3: 

( i ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 240 base pairs 
( B ) TYPE: Dodcic acid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 

(it) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Fibroblast 
( B ) CLONE: 54065 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:13: 



GTTGACATCT 


GGTCCCTGGG 


CATATGGCC A 


TCGAAATGAT 


T<i AAGGGGAG 


CCTCAT ACCT 


6 0 


CAATGAAAAC 


CCTTGAGAGC 


CTTGT ACCTC 


ATTGCCACCA 


ATGGGACCCC 


AGAACTT CAG 


12 0 


AACCC AG AGA 


AGCTGTCAGC 


T AT CTT CCGG 


GACTTTCTGA 


ACCGCTGTCT 


CGAGATGGAT 


1 8 0 


GTGGAGAAGA 


GAGGTTCAGC 


T AAAGAGCTG 


CTACAGCATC 


AATTCCTGAA 


GATTGCCAAT 


2 4 0 



( 2 ) INFORMATION FOR SEQ ID NO:l4: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 195 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRAND ED NESS: single 

( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Fibroblast 
( B ) CLONE: 56494 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:14: 
AACAGTGAAG AGCTCCGAGA AATTATGGGT ACCCTGATAT GTGGCTCCTG AAATTTAGTT 
ATGATCCTAT AAGCATGGCA ACAGATATTG GAGCATTGGA GTGTTAACAT ATGTCATGCT 
TACAGGAATA TCACCTTTTT AGGCAATGAT AAACAAGAAA CATTCTTAAA CATCTCACAG 
ATGATTTTAA GTTAT 

( 2 ) INFORMATION FOR SEQ ID NO:15: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 207 base pairs 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Skeletal Muscle 
( B ) CLONE: 58029 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:15: 

GGAGTGTTTA TCGAGCCAAA TGGATATCAC AGGACAAGGA GGTGGCTGTA AAGAAGCTCC 

TCAAAATAGA GAAAGAGGCA GAAATACTCA GTGTCCTCAG TCACAGAAAC ATCATCCAGT 

TTTATGGAGT AATTTTGAAC CTCCCAACTA TGGCATTGTC ACAGAATATG CTTCTTGGGT 



6 0 

1 2 0 

1 8 0 

1 9 5 



6 0 
1 2 0 
1 8 0 
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CACTCTATGA TTACATTAAC AGTACAA 207 

( 2 ) INFORMATION FOR SEQ ID NOU6: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 184 base pain 
( B ) TYPE: nucleic acid 
( C ) STRAND ED NESS : single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v I I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Placenta 
( B ) CLONE: 64663 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:16: 

CGGGGTGGTA AAACTTGGAG AT CTTGGGAT TGGCGGTTTT AGCTCAAAAA CCACAGCTGC 6 0 

ACATTCTTTA GTTGGTACGC CTATTCATGT TCCAGAGGAT ACAGAAATGG ATACAACTTC 120 

AAATCTCATC TGGTCTCTTG GCTGTCTACT ATATGGATGG CTGCATTACA AAGTCCTTTC 180 

T AT G 1 8 4 

( 2 ) INFORMATION FOR SEQ ID NO:17: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 206 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: HUVEC Sheer Stress 
( B ) CLONE; 67967 

(ii) SEQUENCE DESCRD7TION: SEQ ID NO:17: 

TGAATTGCTG AGCATAGACC TTTATGAGCT GATTAAAAAA AATAAGTTTC AGGTTTTAGC 6 0 

GTCCAGTTGG TACGCAAGTT TGCCCAGTCC ATCTTGCAAT CTTTGGTGCC CTCCACAAAA 120 

TAAGATTATT CACTGCGATC TGAGCCAGAA AACATTCTCC TGAAACACCA CGGGCGCAGT 180 

TCAACCAAGG TCATTGACTT TGGGTT 2 06 

( 2 ) INFORMATION FOR SEQ ID NOUS: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 268 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: HUVEC Sheer Stress 
( B ) CLONE: 68963 

( x I ) SEQUENCE DESCRHTION: SEQ ID NO:18: 

GGGAAGTGGC CAGTTTGGAG TGGTCAGCTG GGCAAGTGGA AGGGGCAGTA TGATGTTGCT 60 

GTTAAGATGA TCAAGGAGGG CTCCATGTCA GAAGATGAAT TTTTCAGGAG GCCCAGACTA 120 

TATGAAACTC AGCCATCCCA AGCTGGTTAA ATTCTATGGA GTGTGTTAAA GGATTACCCC 180 

ATATACATGT GACTAATATA TAGCAATGCT TGCTTTTCTG AATTACCTGG GGACTCACGG 2 40 

AAAAAGGACT TTTAACCCTT CCCGCTTG 268 
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( 2 ) INFORMATION FOR SEQ ID NO:L9: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 224 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Placenta 
( B ) CLONE: 7190* 

( z i ) SEQUENCE DESCRIPTION: SEQ CD NO:i9: 

CCTGGGGTGG TAAAACTTGG AGACTTGGCT TGGCCGGTTT TCCACCTCAA AAACCACAGC 60 

TGCACATCCT TTAGTTGGTA CGCCTTATTA CATGTTCCAG AGAGATACAT GAAAATGGAT 120 

ACAACTCAAA CTGACATCTG GCCTTTGGCT GTTACTATAT GAATGGCTGC TTACAAAGCC 180 

TTCCTATGGT GACAAAATGA TTTTACTCAT TGTGTAAGAG ATAG 2 24 

( 2 ) INFORMATION FOR SEQ ID NO:20: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 195 base paira 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Pborbol 
( B ) CLONE: 75289 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:20: 

GCGGGGAATG ACTCCCTATC CTGGGGTCCA GAACCATGAG ATGTATGATA TCTTCTCCAT 6 0 

GGCCACAGGT TGAAGCAGCC CGAAGACTGC CTGGTGAACT GTATGAAATA ATGTACTCTT 120 

GCTGGAGAAC CGATCCCTTA GACCGCCCCA CCTTTTCATA TTGAGGCTGC AGCTAGAAAA 180 

ACTCTTAGAA AGTTT 195 

( 2 ) INFORMATION FOR SEQ ID NO:21: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 219 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Rheumatoid Synovium 
< B ) CLONE: 81865 

( x i ) SEQUENCE DESCRIPTION: SEQ ED NO:21: 

CACACGAGAA GCAGAAACAC GACGGGCGGG TAAGATCGGC CACTACATTC TGGTGACACG 6 0 

CTGGGGGTCG GCACCTTCGG CAAAGTGAAG GTTGGCAAAC ATGATTGACT GGCATAAAGT 120 

AGCTGTAAGA TACTCATCGA CAGAAGATTC GGAGCCTTGA TGTGGTAGGA AAAATCCCAG 180 

GAAATTCAGA ACCTCAAGCT TTTCAGGCAT CCTCATATA 219 

( 2 ) INFORMATION FOR SEQ CO NO:22: 

( i ) SEQUENCE CHARACTERISTICS: ' 
( A ) LENGTH: 181 base pairs 
( B ) TYPE: nucleic acid 
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( C ) STRANDEDNESS; singe 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: HUVEC Sheer Stress . 
( B ) CLONE: 82056 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N'032: 

CCACCAAAGA TCTCAAATAA AGTTGATGTG TGGTCGGTGG GTGTATCTCT ATCAGTGTCT 60 

TTATGGAAGG AAGCCTTTTG GCCATAACCA GTCTCAGCAA GACATCCTAC AAGAGAATAC 120 

GATTTTAAAG CTACTGAAGT GCAGTTCCCG CCAAAGCCAG TAGTAACACC TGAAGCAAAG 180 
G 1 8 1 

( 2 ) INFORMATION FOR SEQ ID NO:23: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 218 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: AML Blast 
( B ) CLONE: 108485 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:23: 

TATGGTTATA TGGAAGAGAA TGTGACTGGT GGTCGGT.TGG GGTATTTT TA TACGAAATGC 6 0. 

TTGTAGGTGA TACACCTTTT T AT G C AG AT T CTTTGGTTGG AA C T T A C A G T A A A A T T A T G A 120 

ACCATAAAAA TTCACTTACC TTTCCTGATG ATAATGACAT ATCAAAAGAA GCAAAAAACC 180 

TTATTTGTGC CTTCCTTACT GACAGGGAAG TGAGGTTA 218 

( 2 ) INFORMATION FOR SEQ ID NO:24: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 264 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: cDNA 

(vi i) IMMEDIATE SOURCE: 

( A ) LIBRARY: Testis . 
( B ) CLONE* 114973 

( x i ) SEQUENCE DESCRD7TION: SEQ \D NO:24: 

GACGGTGGCC ATTTGACATG TGGAGCCTGG GTGCATCACG GTGGAGTTGT ACACGGGCTA 6 0 

CCCCCTGTTC CCCGGGAGAA TGAGGTGGAG CAGCTGGCCT GCATCATGGA GGTGCTGGGT 120 

CTGCCGCCAG CCGGCTTCAT TCAGACAGCC TCCAGGAGAC AGACATTCTT TGATTCCAAA 180 

GGTTTTCCTA AAAATATAAC CACAACCAGG GGAAAAAAAG ATTCCAGATT CCAAGGGCCC 240 

TCACGGATTG GTGCTGAAAA AACT 264 

( 2 ) INFORMATION FOR SEQ ID NO-.25: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 236 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRAND ED NESS: single 

( D ) TOPOLOGY: linear 
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( 1 1 ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Skdul Mnsde 
( B ) CLONE: 118591 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N025: 

GACTGAGGAC ACTGAAACAT CATCCAGTTT TATGGAGTAA TTCTTGAACC TCCCAACTAT 6 0 

GGCATTGTCA CAGAATATGC TTCTCTGGGA TCACTCTATG ATTACATTAA CAGTAACAGA 120 

AGTGAGGAGA TGGATATGGT CACATTATGA CCTGGGCCAC TGATGTAGCC AAAGGAATGC 180 

ATTATTTACA TATGGGGCTC CTGTCAAGGT GATTCACAGA GACCTCAAGT CAAGGA 236 

( 2 ) INFORMATION FOR SEQ ID NO:26: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 200 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE: cDNA 

( v I I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Skeltal Mnscle 
( B ) CLONE: 119819 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:26: 

CCTGCATGGC CTTCGAGCTG GCCACTGGTG ACTACCTGTT CGAGCCGCAT TCTGGAGAAG 6 0 

ACTACAGTCG TGATGAGGGT AAGGGGTGAG GGCTCTGGGC TCAGCCTCCC GGCCTCCCGG 120 

CCTGCCTGCC CCCAACCTCC TCTTTTGCCC ACAGACCACA TCGCTCACAT AGTGGAGCTT 180 

CTGGGGGACA TCCCCCCAGC 200 

( 2 ) INFORMATION FOR SEQ ID NO:27: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 217 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: I bear 

( i i ) MOLECULE TYPE: cDNA 

( v I I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Skeletal Mnscle 
( B ) CLONE: 120376 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:27: 

GATTACAAGT AGCTTGGTTG TAGTGGAAAA AAACGAGAGA TTAACCATTC CAAGCAGTTG 60 

CCCCAGAAGT TTTGCTGAAC TTTACATCAG TTTGGGAAGC TGATGCCAAG AAACGGCCAT 120 

CATTCAAGCA AATCATTTCA ATCCTGGGTC CATGTCAAAT GACACGAGCC TTCCTGCAAG 180 

TGTAACTCAT TCCTACACAA CAAGGCGGAG TGGAGGT 2 1 7 

( 2 ) INFORMATION FOR SEQ ID NO:28: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: L56 base pairs 
( B ) TYPE: Qndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Bone Marrow 
( B ) CLONE: 132750 
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( x i ) SEQUENCE DESCRIPTION: SEQ ID Nth2& 

GTAGATTTGA CTCTGTTGTT TTCTCTCGTA GTTCCCAAAC TCATGGAAGT CTGTTTTTAT 6.0 

CAATATGATG T A A A G T C T G A A A T A T A C A G C TTTGGAATCG TCCTCTGGGA AATCGCCACT 120 

GGAGATATCC CGTTTCAAGG CTGTAATTCT GAGAAG 156 



( 2 ) INFORMATION FOR SEQ ID NCk29: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 224 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T Lymphocyte 
( B ) CLONE: 140052 

( z i ) SEQUENCE DESCRIPTION: SEQ ID N029: 



TGTAAATAAG GCCCTTCTCC ACTTGACTTC AGGCAGCAGA TTGTCTAGAA GCCTAAGGAC 6 0 

AGCAATTTCT CTGACAAGAC AAAGTAGATA TTTTATACCA GGGGTTGGCA AACTACTGCC 120 

CACGGGCCGA ATTTGGCCCA GTCTGTTTTT GTATGGTGCA AACTAAAAAT GATTTTTACA 180 

TTTTTAAAGA GTTATAAAAG AAAAAAATAT GTGGTCTGTG A A A T 224 



( 2 ) INFORMATION FOR SEQ ID NO:30: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 198 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T Lymphocyte 
( B ) CLONE: 146392 

( * i ) SEQUENCE DESCRIPTION: SEQ ID NO:30: 



TTTTCTTTGT GTTTTTTTTT GTTCCAGTTT ATTTTAAATG CATATTTTAG TTGATTGCTT 6 0 

TTTTAAAAAG CCCCCTCTGG CCTCCTGATT CCAGCTAGTG TCAGCAGTGG GATACCTGCG 120 

CTTGAAGGAC ATCATCCACC GTGACATCAA GGATGAGAAC ATCGTGATCG CCGAGGACTT 180 

CACAATCAAG CTGATAGT 198 



( 2 ) INFORMATION FOR SEQ ID NO:31: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 210 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Phorbol LPS 
( B ) CLONE: 156108 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-J1: 

TGAAAACTAT GAACCTGGAC AAAAATCAAG GGCCAGTATC AAGCACGATA TATATAGCTA 60 

TGCAGTTATC ACATGGGAAG TGTTATCCAG AAAACAGCCT TTTGAAGATG TCACCAATCC 120 



^ 
i 
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TTTGCAGATA ATGTATAGTG TGTCACAAGG ACATCGACCT GTTATTAATG AAGAAACTTT 180 
GCCATATGAT ATACCTCACC GACCACGTAT 2 1 0 

( 2 ) INFORMATION FOR SEQ ID NOJ2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 202 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS : single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Bone Marrow 
( B ) CLONE: 173627 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOJ2: 

AGAAGATCGG GGCCGGCTTC TTCTCTGAGG TCTACAAGGT TCGGCACCGA CAGTCAGGGC 6 0 

AAGTATGGTG CTGAAGATGA ACAAGCTCCC CAGTAACCGG GGCAACACAC TACGGGAAGT 120 

GCAGCTGATG AACCGGCTCA GGCACCCCAA CATCCTAAGG TTCATGGGAG TCTGTGTGCA 180 
CCAGGGACAG CTGCACGCTC TT 202 

( 2 ) INFORMATION FOR SEQ ID NO-.33: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 222 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE* 

( A ) LIBRARY: Placenta 
( B ) CLONE: 181971 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOJ3: 

CGTTTTTGGA GGGTTCACAC CTGTCCCTTT CAAATGCTGG CGCTTTCACA CACTCCTTCT 6 0 

CTCCTGCCAG CACCTTCTGG TCTCAGGAGC ATTGCAGGAT GTTGTGTGAG TAAGTATGGG 120 

AGACACTTTA GTATGGCTTT TTTCAGCTTA GCCTCCTGTT ATCAGAGAGC AGTCTCTTTC 180 

AGTGTCAAGG TTTGAGTACT AGATGGTGGA GAAAGCCTGT TT 222 

( 2 ) INFORMATION FOR SEQ ID NO:34: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 192 base pairs 
( B ) TYPE: nucleic acid . 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Placenta 
( B ) CLONE: 182538 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-J4: 

CTTGGGGTGG TAAAACTTGG AGATCTTGGG CTTGGCCGGT TTTTCAGCTC AAAAACCACA 60 

GCTGCACATT CTTTAGTTGG TACGCCTTAT TACATGTCTC CAGAGAGAAT ACATGAAAAT 120 

GGATACAACT TCAAATCTGA CATCTGGTCT CTTGGCTGTC TACTATATGA GATGGCTGCA 180 

TTACAAAGTC CT j92 
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( 2 ) INFORMATION FOR SEQ ID NO-J5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 152 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D )TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE- cDNA 

( v i I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Cardiac Muscle 
( B ) CLONE: 184416 

( i i ) SEQUENCE DESCRIPTION: SEQ ID NO-J5: 

CTATGGAAGG CCGCTGGCAG GGCAATGACA TTGTCGTGAA GGTGCTGAAG GTTCGAGACT 60 

GGAGTACAAG GAAGAGCAGG GACTTCAATG AAGAGTGTCC CCGGCTCAGG ATTTTTCGCA 120 

TCCAAATGTG CTCCCAGTGC TAGGTGCCTG CC 152 

( 2 ) INFORMATION FOR SEQ ID NO-J6: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 152 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Rheumatoid Synovium 
( B ) CLONE: 191283 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:36: 

CAACTACAGT GAACCTAAAA TGCCTCTAAT ACCTTTGCAA TTATCTTTAA GAGGATATCT 60 

TATGAGTGAA ATTAACTTGT GCAACTACTT TCCTATTCAC TTTTTTACAG AGACTTAAAA 120 

CCAGACAATA TTTCTAGATT CACAGGGACA CT 152 

( 2 ) INFORMATION FOR SEQ ID NO:37: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 199 base pairs 
( B ) TYPE: nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: cDNA 

(vi I) IMMEDIATE SOURCE: 

( A ) LIBRARY: Rheumatoid Synovium 
( B ) CLONE: 192268 

( x i ) SEQUENCE DESCRDTTON: SEQ ID NO-J7: 

AGTGGACTGC AGTAAGCAGA GCTTCCTGAC CGAGOTGGAG CAGCTGTCCA GGTTTCGTCA 60 

CCCAAACATT GTGGACTTTC TGGCTACTGT GCTCAGAACG GCTTCTACTG CCTGGTGTAC 120 

GGCTTCCTGC CCAACGGCTC CCTGGAGGAC CGTTCCACTG CCAGACCCAG GCCTGCCCAC 180 

CTCTCTCCTG GCCTCAGCG 199 

( 2 ) INFORMATION FOR SEQ Q> NO:3S: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 189 base pairs 
( B ) TYPE: nudeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



( i i ) MOLECULE TYPE: cDNA 
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( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Stomach 
( B ) CLONE: 214915 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-J8: 



AGAAGATCCA GTACCTGGTG TATCAATGCT CAAAGGCCTT AAGTACATCC ACTCTCTGGG 60 

GTCGTGCACA GGGACCTGAA GCCAGGCAAC CTGGCTGTGA ATAGGACTGT AACTGAAGAT 120 

TCTGGATTTT GGGCTGGCGC GACATGCAGA CGCCGAGATG ACTGGCTACG TGGTGACCCG 180 

CTGGTACCT 189 



( 2 ) INFORMATION FOR SEQ ID N039: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 167 base pahs 
( B ) TYPE: ooclcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i 1 ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Pancreas 
( B ) CLONE: 223163 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-J9: 

CTTGCTCTTC TGACAGGATG AGAGTTATTA TAAGCAAATC CTACCTAGAG GCTTTTAACT 6 0 

CTAATGGGAA TAACTTGCAA CTAAAAGACC CAACTTGCAG ACCAAAATTA TCAAATGTTG 120 

TGGATTTT CT GTCCCTCTTA ATGGATGTGG TACAATCAGA AAGGTAG 167 



( 2 ) INFORMATION FOR SEQ ID NO:40: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 197 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Small Intestine 
( B ) CLONE: 237002 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:40: 



CCCAAACCTG CCCAGCCAGC CCTGAAAATG CAAGTTTTGT ACGATTTTGA AGCTAGGAAC 60 

CCACGGGAAC TGACTGTGGT CCAGGGAGAG AAGCTGGAGG TTTGGACCAC AGCAAGCGGT 120 

GGTGGCT G G T GAAGAATAGG CGGGACGGAG CGGCTACATT CCAAGCAACA TCTGGGCCCC 180 

TACAGCCGGG GACCCCG 197 



( 2 ) INFORMATION FOR SEQ ID NO:41: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 207 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE,- cDNA 

( v i i ) IMMEDIATE SOURCE; 

( A ) LIBRARY: Hippocampus 
( B ) CLONE: 239990 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:41: 
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CCAAGATGCT GGAGGAACTC AAGCCGAGAC TTGTACCAAG GAGAGATGAG CAGGAAGGAG 6 0 

CCAGAGGGCT CTGAGAAAGA cgggacttcc TGGTCAGGAA GAGCACCACC AACCCGGGCT 120 

CCTTTTCCTC ACGGGCATGC ACAATGGCCA GGCAAGCACC TGCTGCTCTT GGACCCAGAA 180 

GGCACGTCCG GACAAAGGCA GAGTCTT 207 

( 2 ) [NFORMATION FOR SEQ CD NCH42: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 195 base pain 
( B ) TYPE: nodcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Hippocampus 
( B ) CLONE: 240142 

( x i ) SEQUENCE DESCRIPTION: SEQ ED NO:42: 

GTCACCGGAG AGGATCCATG AGAACGGCTA CAACTTCAAG TCCGACATCT GGTCCTTGGG 6 0 

CTGTCTGCTG TACGAGATGG CAGCCCTCCA GAGCCCCTTC TATGGAGATA AGATGAATCT 120 

TTCTCCCTGT GCCAGAAGAT CGAGCAGTGT GACTACCCCC CACTCCCCGG GGAGCACTAC 180 

TCCGAGAAGT TACGT l95 

( 2 ) INFORMATION FOR SEQ ID NO:43: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 213 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRAND EDNESS : single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Testes 
( B ) CLONE: 275781 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:43: 

CTCGTCTATT CGGCACGAGT TTCATTGTCG AAGGAAATAT AAACTGTCTG GAAGATCTGG 60 

TGTAGCTCCT TCGAGACATC TTTGGCGATC AGCATCACCA ACGGTAAGAA GTGTAGTAAG 120 

CCAGATCTCA GGGCCAGGCA TCCCCAGTTG CTGTACAAGA GCAGGCTTTC AAGATGCTTC 180 

AAGGTCCCTG TCCATCAATA TGCTACACAT TTG 213 

( 2 ) INFORMATION FOR, SEQ ED NO:44: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 425 base pairs 
( B ) TYPE* nucleic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: cDNA 

( v ! | ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Eosinophils 
( B ) CLONE: 285465 

( x i ) SEQUENCE DESCRIPTION: SEQ ED NO:*4: 

AAATACTTGA AGCAGTTTAT TATCTACATC AGAATAACAT TGTACACCTT GATTTAAAGC 60 

CACAGAATAT ATTACTGAGC AGCATATACC CTCTCGGGGA CATTAAAATA GTAGATTTTG 120 

GAATGTCTCG AAAAATAGGG CATGCGTGTG AACTTCGGGA AATCATGGGA ACACCAGAAT 180 
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ATT TAG C T C C AGAAATCCTG AACTATGATC CCATTACCAC AGCAACAGAT ATGTGGAATA 



2 4 0 



TTGGTATAAT AGCATATATG TTGTTAACTC ACACATCACC ATTTGTGGGA GAAGATAATC 



3 0 0 



AAGAAACA T A CCTCAATATC TCTCAAGTTA ATGTAGATTA TTCGGAAGGA ACTTTTTCAT 



3 6 0 - 



CAGTTTCACA GCTGGCACAG ACTTTATTCA GAGCTTTTAG TAAAATCAGA OGAAAGGCCC 



4 2 0 



ACAGC 



4 2 5 



( 2 ) INFORMATION FOR SEQ ID NO:45: 



( i ) SEQUENCE CHARACTERISTICS: 



( A ) LENGTH: 1851 base pairs 
( B ) TYPE: nucleic acid 



( C ) STRAND EDNESS : single 
( D ) TOPOLOGY: linear 



( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Stomach 
( B ) CLONE: 214915E 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:4S: 

GCCCGTTGGG CCGCGAACGC AGCCGCCACG CCGGGGCCGC CGAG A f T CGGG TGCCCGGGAT 60 

GAGCCTCATC CGGAAAAAGG GCTTCTACAA GCAGGACGTC AACAAGACCG CCTGGGAGCT 120 

GCCCAAGACC TACGTGTCCC CGACGCACGT CGGCAGCGGG GCCTATGGCT CCGTGTGCTC 180 

GGCCATCGAC AAGCGGTCAG GGGAGAAGGT GGCCATCAAG AAGCTGAGCC GACCCTTTCA 2 40 

GTCCGAGATC TTCGCCAAGC GCCCC1 ACCG GGAGCTGCTG TTGCTGAAGC ACATGCAGCA 300 

TGAGAACGTC ATTGGGCTCC TGGATGTCTT CACCCCAGCC TCCTCCCTGG AACTTCTATG 3 60 

ACTTCTACCT GGTGATGCCC TTCATGCAGA CGGATCTGCA GAAGATCATG GGGATGGAGT 42 0 

TCAGTGAGGA GAAGATCCAG TACCTGGTGT ATCAGATGCT CAAAGGCCTT AAGTACATCC 480 

ACTCTGCTGG GGTCGTGCAC AGGGACCTGA AGCCAGGCAA CCTGGCTGTG AATGAGGACT 540 

GTGAACTGAA GATTCTGGAT TTGGGGCTGG CGCGACATGC AGACGCCGAG ATGACTGGCT 600 

ACGTGGTGAC CCGCTGGTAC CGAGCCCCCG AGGTGATCCT CAGCTGGATG CACTACAACC 660 

AGACAGTGGA CATCTGGTCT GTGGGCTGTA TCATGGCAGA GATGCTGACA GGGAAAACTC 720 

TGTTCAAGGG GAAAGATTAC CTGGACCAGC TGACCCAGAT CCTGAAAGTG ACCGGGGTGC 7 80 

CTGGCACGGA GTTTGTGCAG AAGCTGAACG ACAAAGCGGC CAAATCCTAC ATCCAGT CCC 84 0 

TGCCACAGAC CCCCAGGAAG GATTTCACTC AGCTGTTCCC ACGGGCCAGC CCCCAGCCTG 9 00 

CGGACCTGCT GGAGAAGATG CTGGAGCTAG ACGTGGACAA GCGCCTGACG GCCGCGCAGG 960 

CCCTCACCCA TCCCTTCTTT GAACCCTTCC GGGACCCTGA GGAAGAGACG GAGGCCCAGC 1020 

AGCCGTTTGA TGATTCCTTA GAACACGAGA AACTCACAGT GGATGAATGG AAGCAGCACA 1080 

TCTACAAGGA GATTGTGAAC TTCAGCCCCA TTGCCCGGAA GGACTCACGG CGCCGGAGTG 1140 

GCATGAAGCT GTAGGGACTC ATCTTGCATG GCACCGCCGG CCAGACACTG CCCAAGGACC 1200 

AGTATTTGTC ACTACCAAAC TCAGCCCTTC TTGGAATACA GCCTTTCAAG CAGAGGACAG 1260 

AAGGGTCCTT CTCCTTATGT GGGAAATGGG CCTAGTAGAT GCAGAATTCA AAGATGTCGG 13 20 

TTGGGAGAAA CTAGCTCTGA TCCTAACAGG CCACGTTAAA CTGCCCATCT GGAGAATCGC 13 80 

CTGCAGGTGG GGCCCTTTCC TTCCCGCCAG AGTGGGGCTG AGTGGGCGCT GAGCCAGGCC 1440 

GGGGGCCTAT GGCAGTGATC CTGTGTTGGT TTCCTAGGGA TGCTCTAACG AATTACCACA 1500 

AACCTGGTGG ATTGAAACAG CAGAACTTGA TTCCCTTACA GTTCTGGAGG CTGGAAATCT 1560 
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GGGATGGAGG 


TGTTGGCAGG 


GCTGTCGTCC 


CTTTGAAGGC 


TCTGGGGAAG 


AATCCTTCCT 


16 2 0 


TGGCTCTTTT 


TAGCTTGTGG 


CGGCAGTGGG 


C AGTCCGTGG 


CATTCCCCAG 


CTTATTGCTG 


16 8 0 


C ATCACTCCA 


GTCTCTGTCT 


CTTCTGTTCT 


CTCCTCTTTT 


AACAACAGTC 


ATTGGATTTA 


17 4 0 


GGGCCCACCC 


TAATCCTGTG 


TGAT CTTATC 


TTGATCCTTA 


TTAATTAAAC . 


CTGCAAAT AC 


1 8 0 0c 


TCTAGTTCCA 


AATAAAGT CA 


CATTCTCAGG 


T AAAAAAAAA 


AAAAAAAAAA 


A 


18 5 1 



We claim: 

1. A purified polynucleotide having a nucleic acid 
sequence selected from the group consisting of SEQ ID 
NO:l, SEQ ID N0:2, SEQ ID NO:3, SEQ ID NO: 4, SEQ 
ID NO:5, SEQ ID NO:6, SEQ ID NO:7, SEQ ID NO:8, SEQ 
ID N0:9, SEQ ID NO:10, SEQ ID NO:ll, SEQ ID NO:12, 
SEQ ID NO:13, SEQ ID NO:14, SEQ ID N0:15, SEQ ID 
NO:16, SEQ ID NO:17, SEQ ID NO:18, SEQ ID NO:19, 
SEQ ID NO:20, SEQ ID NO:21, SEQ ID NO:22, SEQ ID 
NO:23, SEQ ID NO:24, SEQ ID NO:25, SEQ ID NO:26, 
SEQ ID NO:27, SEQ ID NO:28, SEQ ID NO:29, SEQ ID 
NO:30, SEQ ID NO:31, SEQ ID NO:32, SEQ ID NO:33, 
SEQ ID NO:34, SEQ ID NO:35, SEQ ID NO:36, SEQ ID 



NO:37, SEQ ID NO:38, SEQ ID NO:39, SEQ ID NO:40, 
SEQ ID NO:41, SEQ ID NO:42, SEQ ID NO:43, and SEQ 
ID NO:44. 

1S 2. An expression vector comprising the polynucleotide of 
claim 1. 

3. A host cell transformed with the expression vector of 
claim 2. 

4. A method for producing and purifying a polypeptide, 
20 said method comprising the steps of: 

a) culturing the host cell of claim 3 under conditions 
suitable for the expression of the peptide; and 

b) recovering the polypeptide from the host cell culture. 

* * + * * 
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SECRETED PROTEINS AND 
POLYNUCLEOTIDES ENCODING THEM 

FIELD OF THE INVENTION 

The present invention provides novel polynucleotides and 5 
proteins encoded by such polynucleotides, along with 
therapeutic, diagnostic and research utilities for these poly- 
nucleotides and proteins. 



BACKGROUND OF THE INVENTION 



10 



Technology aimed at die discovery of protein factors 
(including e.g., cytokines, such as lympholdnes, interferons, 
CSFs and interleuMns) has matured rapidly over the past 
decade. The now routine hybridization cloning and expres- 
sion cloning techniques clone novel polynucleotides 
"directly** in the sense mat they rely on information directly 
related to the discovered protein (it, partial DNA/amino 
acid sequence of the protein in the case of hybridization 
cloning; activity of the protein in the case of expression 
cloning). More recent "indirect" cloning techniques such as 
signal sequence cloning, which isolates DNA sequences 
based on the presence of a now well-recognized secretory 
leader sequence motif, as well as various PCR-based or low 
stringency hybridization cloning techniques, have advanced 
the state of the art by making available large numbers of 
DNA/amino acid sequences for proteins that are known to 
have biological activity by virtue of their secreted nature in 
the case of leader sequence cloning, or by virtue of the cell 
or tissue source in the case of PCR-based techniques. It is to ^ 
these proteins and the polynucleotides encoding them that 
the present invention is directed. 

SUMMARY OF THE INVENTION 

In one embodiment, the present invention provides a 35 
composition comprising an isolated polynucleotide selected 
from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:l; 

(b) a polynucleotide comprising the nucleotide sequence 40 
of SEQ ID NO:l from nucleotide 247 to nucleotide 
432; 

(c) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:l from nucleotide 328 to nucleotide 
432; 

(d) a polynucleotide comprising the nucleotide sequence 
of the full length protein coding sequence of clone 
BD372_5 deposited under accession number ATCC 
98146; . 

(e) a polynucleotide encoding the full length protein 
encoded by the cDNAinsert of clone BD372_J depos- 
ited under accession number ATCC 98146; 

(f) a polynucleotide comprising the nucleotide sequence 
of the mature protein coding sequence of clone 
BD372_5 deposited under accession number ATCC 
98146; 

(g) a polynucleotide encoding the mature protein encoded 
by the cDNA insert of clone BD372__ 3 deposited under 
accession number ATCC 98146; 

(h) a polynucleotide encoding a protein comprising the 
amino add sequence of SEQ ID NO:2; 

(i) a polynucleotide encoding a protein comprising a 
fragment of the amino acid sequence of SEQ ID NO:2 
having biological activity; 

(j) a polynucleotide which is an allelic variant of a 
polynucleotide of (aHg) above; 
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(k) a polynucleotide which encodes a species homologue 
of the protein of (h) or Q) above. 

Preferably, such polynucleotide comprises the nucleotide 
sequence of SEQ ID NO:l from nucleotide 247 to nucle- 
otide 432; the nucleotide sequence of SEQ ID NO:l from 
nucleotide 328 to nucleotide 432; the nucleotide sequence of 
the full length protein coding sequence of done BD372_^ 
deposited under accession number ATCC 98146; or the 
nudeotide sequence of the mature protein coding sequence 
of done BD372_J deposited under accession number 
ATCC 98146. In other preferred embodiments, the poly- 
nudeotide encodes the full length or mature protein encoded 
by the cDNA insert of done BD372_5 deposited under 
accession number ATCC 98146. 

Other embodiments provide the gene corresponding to the 
cDNA sequence of SEQ ID NO:l or SEQ ID N03. 

In other embodiments, the present invention provides a 
composition comprising a protein, wherein said protein 
comprises an amino add sequence selected from the group 
consisting of: 

(a) the amino add sequence of SEQ ID N02; 

(b) fragments of the amino add sequence of SEQ ID 
NO:2; and 

(c) the amino add sequence encoded by the cDNA insert 
of done 

BD372_5 deposited under accession number ATCC 
98146; the protein being substantially free from other mam- 
malian proteins. Preferably such protein comprises the 
amino add sequence of SEQ ID NO:2. 

In one embodiment, me present invention provides a 
composition comprising an isolated polynudeotide selected 
from the group consisting of: 

(a) a polynudeotide comprising the nudeotide sequence 
of SEQ ID NO:4; 

(b) a polynudeotide comprising the nudeotide sequence 
of SEQ ID NO:4 from nudeotide 316 to nudeotide 
501; 

(c) a polynudeotide comprising the nudeotide sequence 
of the full length protein, coding sequence of done 
BR533_4 deposited under accession number ATCC 
98146; 

(d) a polynudeotide encoding the full length protein 
encoded by the cDNAinsert of doneBR533_4 depos- 
ited under accession number ATCC 98146; 

(e) a polynudeotide comprising the nudeotide sequence 
of the mature protein coding sequence of done 
BR533_4 deposited under accession number ATCC 
98146; 

(f) a polynudeotide encoding the mature protein encoded 
by the cDNAinsert of done BR533_4 deposited under 
accession number ATOC 98146; 

(g) a polynudeotide encoding a protein comprising the 
amino add sequence of SEQ ID NO:5; 

(h) a polynudeotide encoding a protein comprising a 
fragment of the amino acid sequence of SEQ ID NO:5 
having biological activity; 

(i) a polynucleotide which is an allelic variant of a 
polynudeotide of (aHd) above; 

(j) a polynucleotide which encodes a species homologue 
of the protein of (g) or (h) above. 

Preferably, such polynudeotide comprises the nudeotide 
sequence of SEQ ID NO:4 from nucleotide 316 to nude- 
otide 501; the nudeotide sequence of the full length protein 
coding sequence of done BR533__4 deposited under acces- 
sion number ATCC 98146; or the nudeotide sequence of the 
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mature protein coding sequence of clone BR533_4 depos- Other embodiments provide the gene corresponding to the 

ited under accession number ATCC 98146. In other pre- cDNA sequence of SEQ ID NO:7. 

feired embodiments, the polynucleotide encodes the full In other embodiments, the present invention provides a 

length or mature protein encoded by the cDNA insert of composition comprising a protein, wherein said protein 

clone BR533_4 deposited under accession number ATCC 5 comprises an amino acid sequence selected from the group 

98146. consisting of : 

Other embodiments provide the gene Responding to the a &e aminQ add ueQCe of SEQ m N0: 8 ; 

cDNA sequence of SEQ ID NO:4 or SEQ ID NO:6. . ... . ■ - m 

t *u v j- * ™* . „. nf ; rtn nfMfJ j. c a the amino aad sequence of SEQ ID NO:8 from ammo 

In other embodiments, the present invention provides a amino acid 77- 

composition comprising a protein, wherein said protein 10 aaa 1 10 a ~° . ' _ _ 

comprises an amino acid sequence selected from the group (c) fragments of the ammo aad sequence of SEQ ID 

consisting of: NO:8; and 

(a) the amino add sequence of SEQ ID N0:5; (d) the amino acid sequence encoded by the cDNA insert 

(b) fragments of the amino acid sequence of SEQ ID ^ ft ^ on 5 . J J . . 
NO'f-and 15 CC288_9 deposited under accession number ATCC 

(c) the amino add sequence encoded by the cDNA insert 98146; the protein being substantially free from o&er rnam- 
of clone maban proteins. Preferably such protein comprises the 

BR533 4 deposited under accession number ATCC ^^"^^ ^ W °a ff ""^ IS 

98146; ^protein being substantially free from othermam- sequence of SEQ ID N0:8 from amino aad 1 to amino aad 

malian proteins. Preferably such protein comprises me 20 71. aa>odimtnts , me polynucleotide is 

amino acid sequence or SEQ ID NO:5. TT , , ^ . _ , ^ 

L. one embUnent, the present invention provides a 1^ to an expressior , contro ^««~2f 

composition comprising an isolated polynucleotide selected mention ako provides a host cel^ includmg barter^ 

, "JT. ™„„~A„„- rt f„„ „«. yeast insect and mammalian cells, transformed with such 

trom tne group consisting or. * . ... 

(^^<^^^^ * ^SS^rSded for producing a protein, 
oi MiQIDWU./, which comprise: 

' . . . . . . 30 medium; and 

(c) a polynucleotide comprising the nucleotide sequence *• • . ,. ^ 

of toe full length protein coding sequence of done . 0» P™^« * e I*° teln ^ , . . . 

. CC288 9 deposited under accession number ATCC The protein produced according to isuch methods k also 
98146- provided by the present invention. Preferred embodiments 

\ i *j ^i' . a. *,« i..rfk ~„(,in include those in which the protein produced by such process 

(d) a polvnucleohde encoding J£ ^ 35 is a mature form of the protein. * 

encoded by the cDNA insert of done CC288_9 depos. compositions 0 f the present invention may further 

ited under accession number ATCC 98146 ^ g g^^Zcaily J^ue carrier. Composi- 

(e) a polynucleotide comprising the nucleotide sequence ^ comprising m which specifically reacts with 
ofthe mature protein coding sequence of done ^ ^ ^ ^ b me t ^^0,,. 
CC288_9 deposited under accession number ATCC m m ^ £ ovided for preventing, treating or 

98146; ameliorating a medical condition which comprises admin- 

(f) a polynucleotide encoding the mature protein encoded is^g to a mammalian subject a therapeutically effective 
by the cDNA insert of clone CC288_9 deposited under ^ of a composition comprising a protein of the present 
accession number ATCC 98146; ^ invention and a pharmaceutical^ acceptable carrier. 

(g) a polynucleotide encoding a protein comprising the 43 < 

amino add sequence of SEQ ID N0:8; DETAILED DESCRIPTION 

(h) a polynucleotide encoding a protein ^mm^g a ISOLATED PROTONS AND 
fragment of the ammo aad sequence of SEQ ID N0:8 POLYNUCLEOTIDES 
having biological activity; 

(0 a polynucleotide which is an allelic . variant of a 50 Nucleotide and amino add sequences are reported below 
polynucleotide of (aX<9 above; for each done and protein disclosed in the present applies 

(j) a polynudeotide which encodes a spedes homologue tion. In some instances the sequences are preliminary and 
of the protein of (g) or (h) above. may indude some incorrect or ambiguous bases or amino 

Preferably, such polynudeotide comprises the nudeotide adds. The actual nucleotide sequence of each done can 
sequence of SEQ D NO:7 from nudeotide 113 to nudeotide 55 readily be determined by sequencing of the deposited done 
433; the nucleotide sequence of the full length protein in accordance with known methods. The predicted amino 
coding sequence of done CC288_9 deposited under acces- add sequence (both full length and mature) can men be 
sion number ATCC 98164; or the nudeotide sequence of the determined from such nudeotide sequence. The amino aad 
mature protein coding sequence of clone CC288_9 depos- sequence of the protein encoded by a particular done can 
ited under accession number ATCC 98146. In other pre- «o also be determined by expression of the done in a suitable 
fared embodiments, the polynudeotide encodes the full host cell, collecting the protein and determining its 
length or mature protein encoded by the cDNA insert of sequence. 

done CC288__9 deposited under accession number ATCC Far each disdosed protein applicants have identified what 
98146. In yet other preferred embodiments, the present they have determined to be the reading frame best identifi- 
invention provides a polynudeotide encoding a protein 65 able with sequence information available at the time of 
comprising the amino add sequence of SEQ ID NO:8 from filing. Because of the partial ambiguity in reported sequence 
amino add 1 to amino add 77. information, reported protein sequences indude "Xaa" des- 
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fenators These "Xaa" designators indicate either (1) a BLASTX and FASTA search protocols. BR533_4 demon- 
residue which cannot be identified because of nucleotide . strated at least some homology with murine semaphonn E 
sequence ambiguity or (2) a stop codon in the determined (X85994, MastN) .BR533 4 also stows at least some 
nucleotide sequence where applicants believe one should not identity with an EST identified as "yy80dl0.s 1 Homo 
Stflfte* nucleotide seance were determined more 5 sapiens cDNA clone 279859 3" (N38844, BlastN) Based 
accuracy) ^ U pon homology, BR533_4 proteins and each homologous 

As used herein a "secreted" protein is one which, when protdnwp^de may share at least some activity, 

expressed in a suitable host cell, is transported across or Clone "CCZ88 9 
th^ughamemb^ 

sequences in its amino add sequence. "Secreted" proteins 10 tried as clone CC288 9". CC288_9 was^olated tom a 

indude without limitation proteins secreted wholly (e.g., hirmanadul 

soluble proteins) or partially (e.g.. receptors) from the cell in selective for cDNAs encodmg secreted protons. CC288 9 

which the? arc expr^^eleted" proteins also indude to a rull-length done inducing ^^J^J^^ 

without liiutation Fotdns which are transported across the of a secreted protein (also referred to herein as CC288_9 

n^^^ 1 ^ ICtiCUlUm * 15 ^Sudeotide sequence of CC288_9 as presently deter- 
A ^udeodde of the present invention has been iden- imned is reported in SEQ ID NO:7. What applicants ^pres- 
tifiX clone «BD372_5"! BD372.J was isolated from a entry believe to be the proper reading frame and &e pre- 
human fetal kidney cDNA library using methods which are dieted amino acid sequence of the CC288_9 protein 

s^eforcDN^ ^^l^^T^ BUde0t " te ^ 

is a full-length done, induding Die entire coding sequence 20 reported in SEQ ID NO:8. 

of a secreted piotcin (also referred to herein as «BD372_5 The nucleotide sequence closed herein for CC28JL^ 

tWV^ was searched against the GenBank database using BLASTA/ 

ttcYudcotide sequence of the 5' portion of BD372^5 as BLASTX and FACTA search protocols. No hits were found 

presently determined is reported in SEQ ID NO:l. What ^^^ffl^ 

appUcants^^ * Q on es B D372J3 , BR533_4 and CC288_9 were depos- 

the coding region is indicated in SEQ ID N0.2. The pre- im ^ me AmBdm ^ Culture 

dieted add sequence rfmeBD372^ protem wrres^nding ColIcction ^ acccssion number ATCC 98146, from 

to the foregoing nucleotide sequence is reported in XV ^ done comprisill g a particular polynudeodde is 

N02. Amino adds 1 to 27 are the predicted leader/signal obtainable. Each done has been transfected into separate 

sequence, with the predicted mature amino add sequence 30 bactcrial cdls ( Et co fy ^ thto conmosite deposit Each done 

beginning at amino add 28. Additional nudeotide sequence ^ ^ the vector in which it was deposited by 

fromthe3'pordonofBD372^,mdudmgthepolyAtaiLis planning an EcoRl/NotI digestion (5 cite, EcoRI; J dte, 

reported in SEQ ID N03. Noti) to produce the appropriately sized fragment far such 

The EcoRI/NotI restriction fragment obtainable from the doM ( approxilIiate done s iz e fragment are identified 

deposit containing done BD372_5 should be approxi- 35 B ac terial cells containing a reticular done can be 

mately 2300 bp. obtained from the composite deposit as follows: 

The nudeotide sequence disclosed herein for BD372_^5 ^ O iig 0 nudeotide probe or probes should be designed to 

was searched against the GenBank database using BLASTA/ mfi se q Uence that is known for that particular done. This 

BLASTX and FASTA search protocols. BD372.J demon- SC q Uen ce can be derived from the sequences provided 

strated at least some identity with ESTs identified as 40 hcrcin< or from a com bination of those sequences. The 

"yc90f!2.s 1 Homo sapiens cDNAclone 23278 3 m (R39276, se q Ue nce of the oligonudeotide probe that was used to 

BlastN) and ^ST05537 Homo sapiens cDNA done eat^ fuU-length done to identmed bdow, and should 

HFBEM26" (T07647, Fasta). Based upon identity, ^ reliabk m todating me done of interest 
BD372_J proteins and each identical protein or peptide 

may share at least some activity. 45 

done "BR533 4" 

A polynudeodde of the present invention has been iden- 
tified as done M BR533_4". BR533_4 was isolated from a 
hiiTp an fetal kidney cDNA library using methods which are 

selective for cDNAs encoding secreted proteins. BR533_4 50 

isafuU-lengthdone •™^l%°f°?^S^ 1 ™ In the sequences listed above which include an N at position 

of a secreted protein (also referred to herein as BR533_4 ^ tot ^ dfionis< ^^ inprrfOTedprobes/prinietsb y a 

^nudeotide sequence of the 5' portion of BR533.4 as bio^ylated phosphoaxarnidite residue rather than a nude- 

" " . . 15 . " „ A ,/cT7n m what ss otide (such as, for example, that produced by use of biotin 

^S^f^^^^Z^ ph^oramidite ^g^^S£S£ 

T^S^^L^^^ ! * J*"* ^o^-^deprobeshouldpreferably 

Additional nudeotide sequence from the 3' portion of 60 foHow foese parameters: 

K 3 4 kcluding the polyA tail, is reported toSEQ ID (a) It should be designed to an area of fce sequence which 

— ' 6 t~t r . has me fewest ambiguous bases C^s"). if any. 

The EcoRI/NotI restriction fragment obtainable from the (b) It should be designed to have a T m of approx. 80' G 

deposit containing done BR333_4 should be approrimatdy (assuming 2° for each A or T and 4 degrees for each G 

2850 bp 65 or Q. 

The nudeotide sequence disdosed herein for BR533_4 The oligonucleotide should preferably be labeled with g- *P 

was searched against the GenBank database using BLASTA/ ATP (specific activity 6000 O/mmole) and T4 polynudc- 



Clone 


Probe Sequence 


BD37Z_5 


SEQ ID NO: 9 


BR533_4 


SEQ ID NO: 10 


CC288_9 


SEQ ID NO: U 
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otide kinase using commonly employed techniques for clone The mature fonn of suApretem nay be obtained by 

^rSo2o~dT(Ser labeling techniques can expression of the disclosed fulHength Poly^eoUde 

Ssobe S Unincorporated label should preferably be (preferably those deposited wUhATCQ ma suitable imam- 

remove* to gel filtration chromatography or other estab- malian ceU or other host ceU. The sequence Mof the mature 

SE meSods ii? amount of rafflity incorporated 5 M^ot.m^o^^l^^^o 

into the Drobe should be quantitated by measurement in a aad sequence of the full-length form. 

sStmadon coUnto Preferably, spednc activity of the . The present invention also provides genes corresponding 

r^Slbe Sd approrirKy 4e*> dnip/pmolc. to the cDNA sequences disclosed hendn. Jfce<xm^p^ 

^TbaSluUure con&ning the pool of full-length genes can be isolated in accordance witf. known methods 

clones should preferably be thawed and 100 ul of the stock 10 using the sequence information disclosed herein^uch meft- 

used to mL a sterile culture flask containing 25 ml of ods include the preparation of probes « : pnrnen torn the 

rterieTbroth containing ampicillin at 100 ug/mL The disclosed sequence information for identmcarionand/or 

Se shoSd preferably be^wn to saturation at 3T C, amplification of genes in appropriate genomic libranes or 

and the saturated culture should preferably be diluted in other sources of genomic materials 

freshTbroft Aliquots of these dilutions should preferably 15 Where me protein of the present invention * membrane- 

Sriied te Tdetomke me dUution and volume which wDl bound (e.g., is a receptor), the present mvention also pro- 

yTddtproxtoSTsOOO distinct and well-separated colo- vides for soluble forms of such protein In such forms part 

oSdtacTerioloeical media containing L-broth con- or all of the intracellular and transmembrane domains of fte 

S aSlduSTuOO pg/mTand agar at 1.5% in a 150 protein are deleted such that the protein is fully ' secreted 

StridE^en grown overnight aF37° C. Other known 20 from me ceU to which it is expressed The mtraceUular and 

SSS SbStafafSSnet, weU-separated colonies can transmembrane domains of proteins of the invention can be 

memoes oi ooiaming aouum v identified in accordance with known techniques for deter- 

Sta^cS hybridization procedures should then be mination of such domains from sequence information. 

Species homologs of me disclosed 

a u,v» thpm 25 proteins are also provided by the present invention. Species 

wim gentle agitation in 6x SSC (20x stock is 1753 g probes or primers from the sequences provided herein and 

SSlSsJS gNa citrate/liter, adjusted to pH 7.0 with screening a suitable nucleic aad source from the desred 

NaOH) containing 0.5% SDS, 100 ug/rnl of yeast RNA, and species. , 

10 imM EDTA (^proximately 10 mL per 150 mm filter). 30 The invention also encompasses allelic variants of fce 

SSSe S is then added to ^hybridization mix disclosed polynucleotides or proteins; that is, natara^- 

lt a «SnSn neater than or equal to le*> dpm/mL. oaurring alternative forms of flie isolated polynucleotide 

Tne SSn preferably incubated at «• C with gentle whidi also encode proteins which are identical homologous 

Station overnight The filter is then preferably washed in or related to that encoded by the po ^nucleotides. 

SOOmLof 2?SSC/05% SDS at room temperature without 35 The isolated polynucleotide of the invention may be 

f ritaSn ^SSfoUowed by 500 rnLof2x SSC/0.1% operaMy linked to an expression contiol sequence such as 

fflQStf^SnSrTS -gentle shaking for 15 min- the P MT2 or pED expression vectors disclosed m Kaufman 

JS A thX«hSlx1sa0^% SDS at 65« C for 30 et aL, Nucleic Adds Res. 19, 4485^490 (1991), u, order to 

SteTl Tour ^optional. The filter is then preferably produce me protein recombinant*, ^y^bkexpms- 
iried and subjected to autoradiography for sufficient time to 40 sion control sequences are known in the art. General meth- 

toe Sves on the xLy film. Other known ods of expressing recombinant proteins are also known and 

^^J™ ° a i™ be emoloved. are exemplified in R. Kaufman, Methods in Enzymology 

hy £Si^^ culture, and 185, 53W66 (1990). As defined herein "operably linked" 

olasS TdNaSS u7ing Tstan<L procedures. The means that the isolated polynucleotide of die mventionand 

S can Sen b^v^ by restriction analyst, hybrid- 45 an expression control sequence ^^^^ 

SSonaTalysIs, or DNA sequencing. orcellto sucha way h £ 

Fraemente of the proteinTof the present invention which cell which has been traiisformed (transfected) with the 

are caoable of exhibMng biological activity are also encom- ligated polynudeotide/expression control sequence. 

^StaMfae Srffi. Fragments of the protein A number of types of cells may act as smtoble host cdb 
rn^ be k toeaS « S may becyclized using known 50 fcttv^a^V^lftm^^.c^a^ 

in H. U. Saragovi, et al, for example, monkey COS cells Crones* Hamster C£ary 

B^chnotoevlC (773-778 (1992) and in R.S. McDowell, (CHO) cdls, human kidney 293 ceUs, human^fcrmal 

1 SjXKSJ. £c 114, 9245-9253 (1992). both of A431 cells human Colo205 ceHs, S^-J, OW cdb, 

which arefocorporated herein by reference. Such fragments other transformed primate cdl lmes normnl diploid ceUs, 
ma^be fused tSmolecules such as in^oglobulins » cell strains ^i^T^o^Tjl^S 

for many purposes, inducing increasing the valency of primary cxplante. HeLa cells, mouse L cells, BHK, HWft 

protein bindingsites. For example, fragmenteof the protein U937, HaK or Jurkat ceUs. . 

W be fused mrough "linkoTsequences to the Fc portion Atternativdy, it may be possible to P^^CP^ J 

rfLtanoglobulL For a bivalent form of the protein, lower eukaryotes such as yeast or ™^<*£ 
such a fusion txnild be to the Fc portion of an IgG molecule. » bacteria. Potentially ^^^^f^ 

Other immunoglobulin isotypes may also be used to gener- romyecs cerevUiae, Sch^zosaccharomycesp^e 

ate^chfusionl For exampleVa protein-IgM fusion would Kluyveromyces strains, Candida, or any yeast stram capab e 

SSSStiS proteinlfthe invention. of expressing heterdogous proteins. PotentiaUy suitable 

Sresenrmvention also provides both full-length and bacterial strains indude EschericUa coh, Baolhusubrt* 

ofthe such oroteins is identified in the sequence listing by expressing heterologous proteins. If the protein is mate in 

£nda£ i ff£ Tnudeotide sequence of each disdosed yeast or bacteria, it may be necessary to momfy the protein 
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produced therein, for example by phosphorylation ox gly- compounds and in immunological processes for the devel- 
cosylation of the appropriate sites, in order to obtain the opment of antibodies. 

functional protein. Such covalent attachments may be The proteins provided herein also include proteins char- 
accomplished using known chemical ox enzymatic methods. acterized by amino acid sequences similar to those of 

The protein may also be produced by operably linking the 5 purified proteins but into which modification are naturally 
isolated polynucleotide of the invention to suitable control provided or deliberately, engineered. For example, modifi- 
sequenccs in one. or more insect expression vectors, and . canons in the peptide or DNA sequences can be made by 
employing an insect expression system. Materials and meth- those skilled in the art using known techniques. Modifica- 
ods for baculovirus/insect cell expression systems are com- tions of interest in the protein sequences may include the 
mer dally available in kit form from, eg., Invitrogen, San 10 alteration, substitution, replacement, insertion or deletion of 
Diego. Calif., U.SA. (the MaxBat® kit), and such methods a selected amino acid residue in me coding sequence. For 
arc well known in the art, as described in Summers and example, one or more of the cysteine residues may be 
Smith, Texas Agricultural Experiment Station Bulletin No. deleted or replaced with another amino acid to alter the 
1555 (1987), incorporated herein by reference. As used conformation of the molecule. Techniques for such 
herein, an insect cell capable of expressing a polynucleotide 15 alteration, substitution, replacement, insertion or deletion 
of the present invention is "transf orrned." are well known to those skQled in the art (see, eg., U:S. Pat 

m protein ofthe invention may be prepared by culturing No. 4,518,584). Preferably, such alteration, substitution, 
transformed host cells under culture conditions suitable to replacement, insertion ox deletion retains the desired activity 
express the recombinant protein. The resulting expressed of the protein. 

protein may then be purified from such culture (Le., from 20 Other fragments and derivatives of the sequences of 
culture medium or cell extracts) using known purification proteins which would be expected to retain protein activity 
processes, such as gel filtration and ion exchange chroma- in whole or in part and may thus be useful for screening or 
tography.Thepurifican^nofmeprotemrrayakom^ other immunological methodologies may also be easily 

affinity column containing agents which will bind to the made by those skilled in the art given the disclosures herein, 
protein; one or more column steps over such affinity resins 25 Such modifications are believed to be encompassed by the 
as concanavalin A-agarose, heparin-toyopearl® or Qbac- present invention. 

rom blue 3GA Sepharose®; one or more steps involving US ES AND BIOLOGICAL ACTIVITY 

hydrophobic interaction chromatography using such resins 

as phenyl ether, butyl ether, or propyl ether; or immunoaf- The polynucleotides and proteins of the present invention 
finity chromatography. 30 are expected to exhibit one or more of the uses or biological 

Alternatively, the protein of the invention may also be activities (including those associated with assays cited 
expressed in a form which will facilitate purification. For herein) identified below. Uses qx activities described for 
example, it may be expressed as a fusion protein, such as protcinsof me pxe^ntmveim'oninay be provided by admin- 
those of maltose binding protein (MBP), gJutathione-S- istration or use of such proteins ox by adrninistration or use 
transferase (GST) or thioredoxin (TRX). Kits for expression 35 of polynucleotides encoding such proteins (such as, for 
and purification of such fusion proteins are commercially example, in gene therapies or vectors suitable for introduc- 
available from New England BioLab (Beverly, Mass.), Phar- tion of DNA). 
macia (Piscataway, NJ.) and InVitrogen, respectively. The Research Uses and Utilities 

protein can also be tagged with an epitope and subsequently The polynucleotides provided by the present invention 
purified by using a specific antibody directed to such 40 can be used by the research conimunity for various purposes, 
epitope. One such epitope CTlag") is commercially avail- The polynucleotides can be used to express recombinant 
able from Kodak (New Haven, Conn.). protein for analysis, characterization or merapeutic use; as 

Finally, one or more reverse^hase higfi performance markers for tissues in which the corresponding protein is 
liquid chromatography (RP-HPLC) steps employing hydro- preferentially expressed (either constitutively or at a par- 
phobic RP-HPLC media, e.g., silica , gel having pendant 45 ticular stage of tissue differentiation or development or in 
methyl or other aliphatic groups, can be employed to further disease states); as molecular weight markers on Southern 
purify the protein. Some or all of the foregoing purification gels; as chromosome markers or tags (when labeled) to 
steps, in various combinations, can also be employed to identify chromosomes or to map related gene positions; to 
provide a substantially homogeneous isolated recombinant compare with endogenous DNA sequences in patients to 
protein. The protein thus purified is substantially free of 50 identify potential genetic disorders; as probes to hybridize 
other mammalian proteins and is defined in accordance with and thus discover noveX related DNA sequences; as a source 
the present invention as an "isolated protein." of information to derive PCR primers for genetic finger- 

The protein of the invention may also be expressed as a printing; as a probe to "subtract-out" known sequences in 
product of transgenic animals, eg., as a component of the the process of discovering other novel polynucleotides; for 
milk of transgenic cows, goats, pigs, or sheep which are 55 selecting and making oligomers for attachment to a "gene 
characterized by somatic or germ cells containing a nude- chip" or other support, including for examination of expres- 
otide sequence encoding the protein. sion patterns; to raise anti-protein antibodies using DNA 

The protein may also be produced by known conventional immunization techniques; and as an antigen to raise anti- t 
chemical synthesis. Methods for constructing the proteins of DNA antibodies or elicit another immune response. Where 
the present invention by synthetic means are known to those 60 the polynucleotide encodes a protein which binds or poten- 
skflled in the art The synthetically-constructed protein dally binds to another protein (such as, for example, in a 
sequences, by virtue of sharing primary, secondary or ter- receptor-ligand interaction), the polynucleotide can also be 
tiary structural and/or conformational characteristics with used in interaction trap assays (such as, for example, that 
proteins may possess biological properties in common described in Gyuris et aL, Cell 75:791-803 (1993)) to 
therewith, including protein activity. Tiros, they may be 65 identify polynucleotides encoding the other protein with 
employed as biologically active or irnmunological substi- which binding occurs or to identify inhibitors of the binding 
tutes for natural, purified proteins in screening of merapeutic interaction. 
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The proteins provided by the present invention can simi- noL 145:1706-1712, 1990; Bertagnolli et aL, Cellular 
lariy be used in assay to determine biological activity, Immunology 133327-341, 1991; Bertagnolli, et aL, L 
including in a panel of multiple proteins for high-throughput Immunol 1493778-3783, 1992; Bowman et aL, L Immu- 
screening; to raise antibodies or to elicit another immune noL 152:1756-1761, 1994. 

response; as a reagent (including the labeled reagent) in 5 Assays for cytokine production and/or proliferation of 
assays designed to quantitatively determine levels of the , spleen cells, lymph node cells or thymocytes include, with- 
protein (or its receptor) in biological fluids; as markers for . out limitation, wosc described in: Polyclonal T cell 
tissues in which the corresponding protein is preferentially stimulation, Kruisbeek, A* M. and Shevacb, B. M. In Cur- 
expressed (either constitutively or at a particular stage of rent Protocols in Immunology. J. E. e.a. Coligan eds. Vol 1 
tissue differentiation or development or in a disease state); to pp. 3.12.1-3.12.14, John Wiley and Sons, Toronto. 1994; 
and, of course, to isolate correlative receptors or ligands. and Measurement of mouse and human interleuMn y, 
Where the protein binds ox potentially binds to another Schreiber, R. D. In Current Protocols in Immunology, J. E. 
protein (such as, for example, in a receptor-ligand e^. Coligan eds. Vol lpp. 6.8.1-6.8.8, John Wiley and Sons, 
interaction), the protein can be used to identify the other Toronto. 1994. 

protein with which binding occurs or to identify inhibitors of 15 Assays for proliferation and differentiation of hematopoi- 
the binding interaction. Proteins involved in these binding etic and lymphopoietic cells include, without limitation, 
interactions can also be used to screen for peptide or small those described in: Measurement of Human and Murine 
molecule inhibitors or agonists of the binding interaction. Interleukin 2 and InterleuMn 4, Bottomry, K.. Davis, L. S. 

Any or all of these research utilities are capable of being and Lipsky, P.R In Current Protocols in Immunology J. E. 
developed into reagent grade or kit format for commercial- 20 e.a. Coligan eds. Vol 1 pp. 63.1-63.12, John Wiley and 
ization as research products. Sons, Toronto. 1991; deVries et aL, J. Exp. Med. 

Methods for performing the uses listed above are well 173:1205-1211, 1991; Moreau et aL, Nature 336:690-692, 
known to those skilled in the art References disclosing such 1988; Greenberger et aL, Proc NatL Acad. Sci. U.S-A. 
methods include without limitation "Molecular Cloning: A 805931-2938, 1983; Measurement of mouse and human 
Laboratory Manual", 2d ea\, Cold Spring Harbor Laboratory 25 interleukin 6— Jordan, R. In Current Protocols in Immu- 
Press, Sambrook, J., E. R Fritsch andT. Maniatis eds., 1989, nology. J. E. e.a. Coligan eds. Vol 1 pp. 6.6.1-&6.5, John 
and 'Methods in Enzymology: Guide to Molecular Cloning Wiley and Sons, Toronto. 1991; Smith et aL, Proa NatL 
Techniques", Academic Press, Berger, S. L. and A. R. Aced. ScL U.S.A. 83:1857-1861, 1986; Measurement of 
Kimrael eds.. 1987. human Interleukin 11 — Bennett, F., Giannotti, J., Clark, S. 

Nutritional Uses 30 C. and TXirner, K. J. In Current Protocols in Immunology. J. 

Polynucleotides and proteins of the present invention can E. c.a. Coligan eds. Vol 1 pp. 6.15.1 John Wiley and Sons, 
also be used as nutritional sources or supplements. Such uses Toronto. 1991; Measurement of mouse and human Interleu- 
include without limitation use as a protein or amino acid kin 9 — Qarietta, A., Giannotti, J., Clark. S. C and Turner, 
supplement, use as a carbon source, use as a nitrogen source K. J. In Current Protocols in Immunology J. E. ca. Coligan 
and use as a source of carbohydrate. In such cases the protein 35 eds. Vol lpp. 6.13.1, John Wiley and Sons, Toronto. 1991. 
or polynucleotide of the invention can be added to the feed Assays for T-cell clone responses to antigens (which will 
of a particular organism or can be administered as a separate identify, among others, proteins that affect APC-T cell 
solid or liquid preparation, such as in the form of powder, interactions as well as direct T-cell effects by measuring 
pills, solutions, suspensions or capsules. In the case of proliferation and cytokine production) include, without 
microorganisms, the protein or polynucleotide of the inven- 40 limitation, those described in: Current Protocols in 
tion can be added to the medium in or on which the Immunology, Ed by J. E. Coligan, A. M. Kruisbeek, D. H. 
microorganism is cultured. Margulies, E. M. Shevach, W Strober, Pub. Greene Publish- 

Cytokine and Cell iTOlfferadon/Differentiation Activity ing Associates and WHey-Interscience (Chapter 3, In Vitro 
A protein of the present invention may exhibit cytokine, assays for Mouse Lymphocyte Function; Chapter 6, Cytok- 
cell proliferation (either inducing or inhibiting) or cell 45 ines and their cellular receptors; Chapter 7, Immunologic 
differentiation (either inducing or inhibiting) activity or may studies in Humans); Weinberger et aL, Rroc NatL Acad. ScL 
induce production of other cytokines in certain cell popu- USA 77:6091-6095, 1980; Weinberger et aL, Eur. J. Irnrnun. 
lations. Many protein factors discovered to date, including 11:405-411, 1981;Takai etaL, J. ImmunoL 1373494-3500, 
all known cytokines, have exhibited activity in one or more 1986; Takai et aL, J. ImmunoL 140:508-512, 1988. 
factor dependent cell proliferarion assays, and hence the 50 Immune Stimulating or Suppressing Activity 
assays serve as a convenierit confirmation of cytokine activ- A protein of the present invention may also exhibit 
ity. The activity of a protein of the present invention is immune stimulating or immune suppressing activity, includ- . 
evidenced by any one of a number of routine factor depen- ing without limitation the activities for which assays are 
dent cell proliferation assays for cell lines including, without described herein. A protein may be useful in the treatment of 
limitation, 32D, DA2, DA1G, T10, B9, B9/11, BaF3, MC97 55 various immune deficiencies and disorders (including severe 
G, M-KpreB M+), 2E8, RB5, DAI, 123, T1165, HT2, conmmedmimunodeficiency (SOD)), e.g., in regulating (up 
CTLL2,TF-1, Mo7eandCMK. or down) growth and proliferation of T and/or B 

The activity of a protein of the invention may, among lymphocytes, as well as effecting the cytolytic activity of 
other means, be measured by the following methods: NK cells and other cell populations. These immune deft- 

Assays for T-cell or thymocyte proliferation include with- 60 ciencies may be genetic or be caused by vital (eg., HIV) as 
out limitation those described in: Current Protocols in well as bacterial or fungal infections, or may result from 
Immunology, Ed by J. E. Coligan, A. M. Kruisbeek, D. H. autoimmune disorders. More specifically, infectious dis- 
Margulies, E. M. Shevach, W. Strober, Pub. Greene Pub- eases causes by viral, bacterial, fungal or other infection 
lishing Associates and Wiley-Inter science (Chapter 3, In may be treatable using a protein of the present invention, 
Vitro assays for Mouse Lymphocyte Function 3.1-3.19; 65 including infections by HIV, hepatitis viruses, herpesviruses, 
Oiapter 7, Immunologic studies in Humans); Takai et aL. J. mycobacteria, Ldshmania spp., malaria spp. and various 
Immun<rt.l37:3494-35(X),1986;Bertagn fungal infections such as candidiasis. Of course, in this 
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regard, a protein of the present Invention may also be useful allogeneic cardiac grafts in rats and xenogeneic pancreatic 

where a boost to the immune system generally may be islet cell grafts in mice, both of which have been used to 

desirable, Le., in the treatment of cancer. examine the immunosuppressive effects of CTLA4Ig fusion 

Autoimmune disorders which may be treated using a proteins in vivo as described in Lenschow ct aL, Science 

protein of the present invention include, for example, con- 5 257:789-792 (1992) and Turka etaL, Proa NatL Acad. Sa 

nective tissue disease, multiple sclerosis, systemic lupus USA, 89:11102-11105 (1992). In addition, murine models 

erythematosus rheumatoid arthritis, autoimmune pulmo- of GVHD (see Paul ed, Fundamental Immunology, Raven 

nary inflammation, Guillain-Barre syndrome, autoimmune Press, New York, 1989, pp. 846-847) can be used to 

thyroiditis, insulin dependent diabetes mellitis, myasthenia determine the effect of blocking B lymphocyte antigen 

gravis graft-versus-host disease and autoimmune inflam- 10 function in vivo on the development of that disease, 

matory eye disease. Such a protein of the present invention Blocking antigen function may also be therapeutically 

may also to be useful in the treatment of allergic reactions useful for treating autoimmune diseases. Many autoimmune 

and conditions such as asthma (particularly allergic asthma) disorders are the result of inappropriate activation of T cells 

or other respiratory problems. Other conditions, in which that are reactive against self tissue and which promote the 

immune suppression is desired (including, for example, 15 production of cytokines and autoantibodies involved in the 

organ transplantation), may also be treatable using a protein pathology of the diseases. Preventing the activation of 

of the present invention. autoreactive T cells may reduce or eliminate disease symp- 

Using the proteins of the invention it may also be possible toms. Administration of reagents which block costimulation 
tomimuneresponses,maniimbaofways.I>ownregul^ of T cells by disrupting receptorJigand interactions of B 
may be in the form of inhibiting or blocking an immune 20 lymphocyte antigens can be used to inhibit T cell activation 
response already in progress or may involve preventing the and prevent production of autoantibodies or T cell-dcnved 
mductionofaninimuneresponse. The functions of activated cytokines which may be involved in the disease process. 
T cells may be inhibited by suppressing T cell responses or Additionally, blocking reagents may induce antigen-speafic 
by inducing specific tolerance in T cells, or both. Immuno- tolerance of autoreactive T cells which could lead to long- 
suppression of T cell responses is generally an active, 25 term relief from > the disease. The efficacy of Mocking 
non-antigen-specific, process which requires continuous reagents in preventing or alleviating autoimmune disorders 
exposure of the T cells to the suppressive agent Tolerance, can be determined using a number of weU-charactenzed 
which involves inducing non-responsiveness or anergy in T animal models of human autoimmune diseases. Examples 
cells, is distinguishable from iinmunosuppression in that it is include murine experimental autoimmune encephalitis, sys- 
generaUy antigen-specific and persists after exposure to the 30 temic lupus erythmatosis in MRLflpr/lpr mice or NZB 
tolerizing agent has ceased. Operationally, tolerance can be hybrid mice, murine autoimmune collagen arthritis, diabetes 
demonstrated by the lack of a T cell response upon reexpo- mellitus in NOD mice and BB rats, and murine experimental 
sure to specific antigen in the absence of the tolerizing agent myasthenia gravis (see Paul ed., Fundamental Immunology, 

Down regulating or preventing one or mare antigen Raven Press, New York, 1989, pp. 840-856). 

functions (including without limitation B lymphocyte and- 35 Upregulation of an antigen function (preferably a B 

gen functions (such as, for example, B7)), eg., preventing lymphocyte antigen function), as a means of up regulating 

high level lymphokine synthesis by activated T cells, wfll be immune responses, may also be useful in therapy. Upregu- 
useful in situations of tissue, skin and organ transplantation . lation of immune responses may be in the form of enhancing 

and in graft-versus-host disease (GVHD). For example, an existing immune response or eliciting an initial immune 

blockage of T cell function should result in reduced tissue 40 response. For example, enhancing an immune response 

destruction in tissue transplantation. Typically, in tissue through stimulating B lymphocyte antigen function may be 

transplants, rejection of the transplant is initiated through its useful in cases of viral infection. In addition, systemic viral 

recognition as foreign by T cells, followed by an immune diseases such as influenza, the common cold, and encepha- 

rcaction that destroys the transplant The a&mmstration of a litis might be alleviated by the aoministration of stimulatory 

molecule which inhibits or blocks interaction of a B7 45 forms of B lymphocyte antigens systemically. 

lymphocyte antigen with its natural ligand(s) on immune Alternatively, anti-vital immune responses may be 

cells (such as a soluble, monomelic form of a peptide having enhanced in an infected patient by removing T cells from the 

B7-2 activity alone or in conjunction with a monomeric patient, costimulating the T cells in vitro with viral anfagen- 

form of a peptide having an activity of another B lympho- pulsed APCs either expressing a peptide of the present 

cyte antigen (e.g., B7-1, B7-3) or blocking antibody), prior 50 invention or together with a stimulatory form of a soluble 

to transplantation can lead to the binding of the molecule to peptide of the present invention and reintroducing the in 

the natural ligand(s) on the immune cells without transmit- vitro activated T cells into the patient Another method of 

ting the corresponding costimulatory signal. Blocking B enhancing anti-viral immune responses would be to isolate 

lymphocyte antigen function in this matter prevents cytokine infected cells from a patient, transfect mem with a nucleic 

synthesis by immune cells, such as T cells, and thus acts as 55 acid encoding a protein of the present invention as described 

an immunosuppressant Moreover, the lack of costimulation herein such mat the cells express all or a portion of fee 

may also be sufficient to anergize the T cells, thereby protein on their surface, and reintroduce the transfectcd cells 

inducing tolerance in a subject Induction of long-term into the patient The infected cells would now be capable of 

tolerance by B lymphocyte antigen-blocking reagents may delivering a costimulatory signal to, and thereby activate, T 

avoid the necessity of repeated administration of these 60 cells in vivo. 

blocking reagents. To achieve sufficient immunosuppression In another application, up regulation or enhancement of 

or tolerance in a subject, it may also be necessary to block antigen function (preferably B lymphocyte antigen function) 

the function of a combination of B lymphocyte antigens. may be useful in the induction of tumor immunity. Tumor 

The efficacy of particular blocking reagents in preventing cells (eg., sarcoma, melanoma, lymphoma, leukemia, 

organ transplant rejection or GVHD can be assessed using 65 neuroblastoma, carcinoma) transfected with a nucleic add 

animal models that arc predictive of efficacy in humans. encoding at least one peptide of the present invention can be 

Examples of appropriate systems which can be used include aaministered to a subject to overcome tumor-specific toler- 
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ancc in the subject If desired, the tumor cell can be Mixed lymphocyte reaction (MLR) assays (which will 
transacted to express a combination of peptides. For identify, among others, proteins that generate predominantly 
example, tumor cells obtained from a patient can be trans- Thl and CTL responses) include, without limitation, those 
fected ex vivo with an expression vector directing the described in: Current Protocols in Immunology, Ed by J. E. 
expression of a peptide having B7-2-like activity alone, or in 5 Coligan, A. M. Kruisbeek, D. H. Margulies, E. M. Shevach, 
conjunction with a peptide having B7-l-like activity and/or W. Strober, Pub. Greene Publishing Associates and Wiley- 
B7-3-like activity. The transfected tumor cells are returned Inters cience (Chapter 3, In Vitro assays for Mouse Lympho- 
to the patient to result in expression of the peptides on the cyte Function 3.1-3.19; Chapter 7, Immunologic studies in 
surface of the transfected cell Alternatively, gene therapy Humans); Takai et al., J. Immunol 1373494-3500, 1986; 
techniques can be used to target a tumor cell for transfection 10 Takai et aL, J. Immunol 140:508-512, 1988; BertagnoUi et 
in vivo. aL, J. Immunol 1493778-3783, 1992, 

The presence of the peptide of the present invention Dendritic cell-dependent assays (which will identify, 
having the activity of a B lymphocyte antigen(s) on the among others, proteins expressed by dendritic cells that 
surface of the tumor cell provides the necessary costimula- activate naive T-cells) include, without limitation, those 
tion signal to T cells to induce a T cell mediated immune is described in: Guery et al, J. Immunol 134:536-544. 1995; 
response against the transfected tumor cells. In addition, In aba et al., Journal of Experimental Medicine 
tumor cells which lack MHC class I or MHC class II 173:549-559, 1991; Macatonia et al. Journal of Immunol- 
molecules. or which fail to reexpress sufficient mounts of ogy 154:5071-5079, 1995; Porgador et aL, Journal of 
MHC class I or MHC class II molecules, can be transfected Experimental Medicine 182255-260. 1995; Nair et al., 
with nucleic acid encoding all or a portion of (e.g„ a 20 Journal of Virology 67:4062-4069. 1993; Huang et al., 
cytoplasmic-dornain truncated portion) of an MHC class I a Science 264:961-965, 1994; Macatonia et al, Journal of 
chain protein and p 2 microglobulin protein or an MHC class Experimental Medicine 169:1255-1264, 1989; Bhardwaj et 
II a chain protein and an MHC class H p chain protein to al. Journal of Clinical Investigation 94:797-807, 1994; and 
thereby express MHC class I or MHC class H proteins on the Inaba et al., Journal of Experimental Medicine 
cell surface. Expression of the appropriate class I or class II 25 172:631-640, 1990. 

MHC in conjunction with a peptide having the activity of a Assays for lymphocyte survival/apoptosis (which will 
B lymphocyte antigen (e.g., B7-1, B7-2, B7-3) induces a T identify, among others, proteins that prevent apoptosis after 
cell mediated immune response against the transfected superantigen induction and proteins that regulate lympho- 
tumor cell Optionally, a gene encoding an antisense con- cyte homeostasis) include, without limitation, those 
struct which blocks expression of an MHC class H associ- 30 described in: Darzynkiewicz et al., Cytometry 13:795-808, 
ated protein, such as the invariant chain, can also be cotrans- . 1992; Gorczyca et al. Leukemia 7:659-670, 1993; Gorc- 
fected with a DNA encoding a peptide having the activity of zyca et al, Cancer Research 53:1945-1951, 1993; Itoh et aL, 
a B lymphocyte antigen to promote presentation of tumor Cell 66:233-243, 1991; Zacharchuk, Journal of Immunol- 
associated antigens and induce tumor specific immunity. ogy 145:4037-4045, 1990; Zamai et al.. Cytometry 
Thus, the induction of a Tceli mediated immune response in 35 14:891-897, 1993; Gorczyca et aL, International Journal of 
a human subject may be sufficient to overcome tumor- Oncology 1:639-648, 1992. 

specific tolerance in the subject Assays for proteins that influence early steps of T-cell 

The activity of a protein of the invention may, among commitment and development include, without limitation, 
other means, be measured by the following methods: those described in: Antica et aL, Blood 84:111-117, 1994; 

Suitable assays for thymocyte or splenocyte cytotoxicity 40 Fine et al, Cellular Immunology 155:111-122, 1994; Galy 
include, without limitation, those described in: Current et aL, Blood 85:2770-2778, 1995; ToM et aL, Proa Nat 
Rrotocols in Immunology, Ed by J. E. Coligan, A. M. Acad ScL USA 88:7548-7551, 1991. 
Kruisbeek. D. H. Margulies, E M. Shevach, W. Strober, Hematopoiesis Regulating Activity 
Pub. Greene Publishing Associates and Wiley-Interscience A protein of the present invention may be useful in 
(Chapter 3, In Vitro assays for Mouse Lymphocyte Function 45 regulation of hematopoiesis and, consequently, in the treat- 
3.1-3.19; Chapter 7, Immunologic studies in Humans); ment of myeloid or lymphoid cell deficiencies. Even mar- 
Herrmann et aL, Proc. NatL Acad ScL USA 78:2488-2492, ginal biological activity in support of colony forming cells 
1981; Herrmann et al, J. Immunol. 128:1968-1974, 1982; or of factor-dependent cell lines indicates involvement in 
Handa et al., J. Immunol. 135:1564-1572, 1985; Takai et aL, regulating hematopoiesis, eg. in supporting the growth and 
. L Immunol 1373494-3500, 1986; Takai et al., J. ImmunoL 50 proliferation of erythroid progenitor cells alone or in com- 
140:508-5 12, 1988; Herrmann et al., Proc. NatL Acad. ScL bination with other cytokines, thereby indicating utility, for 
USA 78:2488-2492, 1981; Herrmann et al, J. ImmunoL example, in treating various anemias or for use in conjunc- 
128:1968-1974, 1982; Handa et al., J. ImmunoL tion with iiradiadon/chcmotherapy to stimulate the produc- 
135:1564-1572, 1985; Takai et aL, J. Immunol. tion of erythroid precursors and/or erythroid cells; in sup- 
137:3494-3500, 1986; Bowmanet aL, J. Virology 55 portmg&egrowm and proliferation of myeloid cells such as 
61:199^-1998;Takaietal.,LIiniruanoL 140^08-512,1988; granulocytes and monocytes/macrophages (i.e., traditional 
BertagnoUi et aL, Cellular Immunology 133:327-341,1991; CSF activity) useful, for example, in conjunction with 
Brown et al, J. ImmunoL 1533079-3092, 1994. chemotherapy to prevent or treat consequent myelo- 

Assays for T-ceU-dependent immunoglobulin responses suppression; in supporting the growth and proliferation of 
and isotype switching (which will identify, among others, 60 megakaryocytes and consequently of platelets thereby 
proteins that modulate T-cell dependent antibody responses allowing prevention or treatment of various platelet disor- 
andthat affect Thl/Th2 profiles) mdude,wimout limitation, ders such as tfm>mbocytopenia, and generally for use in. 
those described in: Maliszewski, J. Immunol. place of or complimentary to platelet transfusions; and/or in 
1443028-3033, 1990; and Assays for B cell function: In supporting the growth and proliferation of hematopoietic 
vitro antibody production, Mond, J. J. and Brunswick. M. In 65 stem cells which are capable of maturing to any and all of 
Current Protocols in Immunology. J. E ca. Coligan eds. Vol the above-mentioned hematopoietic cells and therefore find 
1 pp. 3.8.1-3.8.16, John Wiley and Sons, Toronto. 1994. therapeutic utility in various stem cell disorders (such as 
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those usually treated with transplantation, including, without (collagenase activity, osteoclast activity, etc.) mediated by 

limitation, aplastic anemia and paroxysmal nocturnal inflamma tory processes. 

hemoglobinuria), as well as in repopulating the stem cell Another category of tissue regeneration activity mat may 

compartment post iiraiuarion/chemothcrapy, either in-vivo be attributable to the protein of the present invention is 
or ex-vivo(Le,,m conjunction wim^^ 5 tendonAigament formation. A protein of the present 

tationor with peripheral - progenitor cell transplantation ^venton, which mduccs tendoa^igamen or 

(homologous or heterologous)) as normal cells or geneti- . °*er ussuefoimation m arc^tances where such tissue is 

v „ 5 . . A . - 6 * not normally formed, has application in the healing of 

caUy manipulated for gene therapy. lament tears, deformities and other tendon or 

The activity of a protein of the invention may, among ^ ^ ^ human$ md ^ animals ^ Sach a 

other means, be measured by the foUowmg methods: 10 tion gloving a tendon/Ugament-like tissue induc- 

Suitable assays for proliferation and differentiation of fog protein may have prophylactic use in preventing damage 
varices hematopoietic lines are ated above. t0 tendon or ligament tissue, as well as use in the improved 

Assays for embryonic stem cell differentiation (which wfll fixation of tendon or ligament to bone or other tissues, and 
identify, among others, proteins that influence embryonic ^ p airing defects to tendon or ligament tissue. De novo 
differentiation hematopoiesis) include, without limitation, 15 tendon/ligament-like tissue formation induced by a compo- 
those described in: Johansson et aL Cellular Biology sition of the present invention contributes to the repair of 
15:141-151, 1995; Keller et al.. Molecular and Cellular congenital, trauma induced, or other tendon or ligament 
Biology 13:473-486, 1993; McClanahan et al., Blood defects of other origin, and is also useful in cosmetic plastic 
812903-2915, 1993. surgery for attachment or repair of tendons or ligaments. The 

Assays for stem cell survival and differentiation (which 20 compositions of the present invention may provide environ- 
will identify, among others, proteins that regulate lympho- ment to attract tendon- or Hgament-fcrming cells, stimulate 
hematopoiesis) include, without limitation, those described growth of tendon- or ligament-forming cells, induce differ- 
in: Methylcellulose colony forming assays, Freshney, M. G. entiation of progenitors of tendon- or ligaraeiit-forming 
In Culture of Hematopoietic Cells. R. L Freshney, et aL eds. cells, or induce growth of tendon/ligament cells or progeni- 
Vol pp. 265-268, Wiley-Liss, Inc., New York, N.V. 1994; 25 tors ex vivo for return in vivo to effect tissue repair. The 
Hirayama et al., Proc. Natl Acad. Sd USA 89:5907-5911, compositions of the invention may also be useful in the 
1992; Primitive hematopoietic colony fc*rrning cells with ' treatment of tendinitis, carpal tunnel syndrome and other 
high proliferative potential, McNIece. L K. and Briddell, R. tendon or ligament defects. The compositions may also 
A, In Culture of Hematopoietic Cells. R. L Ereshney. et al. include an appropriate matrix and/or sequestering agent as a 
eds. Vol pp. 23-39, Wiley-Iiss, Inc.. New York. N.Y. 1994; 30 carrier as is well known in the art 
Nebenetal., Experimental Hematology 22353-359, 1994; The protein of the present invention may also be usefulfor 
Cobblestone area forming cell assay, Pioemacher, R. R In proliferation of neural cells and far regeneration of nerve 
Culture ofHematopoietic Cells. R. L Freshney, et aL eds. Vol and brain tissue, i.e. for the treatment of central and periph- 
pp. 1-21, Wiley-Liss, Inc., New York, N.Y. 1994; Long term eral nervous system diseases and neuropathies, as well as 
bone marrow cultures in the presence of stromal cells, 35 mechanical and traumatic disorders, which involve 
Spooncer, E. , Dexter. M. and Allen, T. In Culture ofHemato- degeneration, death or trauma to neural cells or nerve tissue. 
poietic Cells. R. L Freshney, et aL eds. Vol pp. 163-179, More specifically, a protein may be used in the treatment of 
Wiley-Liss, Inc., New York, N.Y. 1994; Long term culture diseases of the peripheral nervous system, such as peripheral 
initiating cell assay, Sutherland, H. J. In Culture ofHemato- nerve injuries, peripheral neuropathy and localized 
poietic Cells. R- L iTeshney, et aL eds. Vol pp. 139-162, 40 neuropathies, and central nervous system diseases, such as 
Wiley-Liss, Ino, New York, N.Y. 1994. Alzheimer's, Parkinson's disease, Huntington's disease, 

Tissue Growth Activity amyotrophic lateral sclerosis, and Shy-Drager syndrome. 

A protein of the present invention also may have utility in Further conditions which may be treated in accordance with 
compositions used for bone, cartilage, tendon, ligament the present invention include mechanical and traumatic 
and/or nerve tissue growth or regeneration, as well as for 45 disorders, such as spinal cord disorders, head trauma and 
wound healing and tissue repair and replacement, and in the cerebrovascular diseases such as stroke. Peripheral neuro- 
treatment of burns, incisions and ulcers. pathies resulting from chemotherapy or other medical thera- 

A protein of the present invention, which induces card- pies may also be treatable using a protein of the invention, 
lage and/or bone growth in circumstances where bone is not Proteins of the invention may also be useful to promote 
normally formed, has application in the healing of bone 50 better or faster closure of non-healing wounds, including 
fractures and cartilage damage or defects in humans and without limitation pressure ulcers, ulcers associated with . 
other animals. Such a preparation employing a protein of the vascular insufficiency, surgical and traumatic wounds, and 
invention may have prophylactic use in closed as well as the like. 

open fracture reduction and also in the improved fixation of It is expected that a protein of the present invention may 
artificial joints. De novo bone formation induced by an 55 also exhibit activity for generation or regeneration of other 
osteogenic agent contributes to the repair of congenital, tissues, such as organs (deluding, far example, pancreas, 
tra uma induced, or oncologic resection induced craniofacial liver, intestine, kidney, skin, endothelium), muscle (smooth, 
defects, and also is useful in cosmetic plastic surgery. skeletal or cardiac) and vascular (including vascular 

A protein of this invention may also be used in the endothelium) tissue, or far promoting the growth of cells 
treatment of periodontal disease, and in other tooth repair 60 comprising such tissues. Part of the desired effects may be 
processes. Such agents may provide an environment to by inhibition or modulation of fibrotic scarring to allow 
attract bone-forming cells, stimulate growth of bone- normal tissue to regenerate. A protein of the invention may 
forming cells or induce differentiation of progenitors of also exhibit angiogenic activity. 

bone-forming cells. A protein of the invention may also be A protein of the present invention may also be useful for 
useful in the treatment of osteoporosis or osteoarthritis, such 65 gut protection or regeneration and treatment of lung or liver 
as through stimulation of bone and/or cartilage repair or by fibrosis, rcperfusion injury in various tissues, and conditions 
blocking inflammation or processes of tissue destruction resulting from systemic cytokine damage. 
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A protein of the present invention may also be useful for particular protein has chemotactic activity for a population 
promoting or inhibiting differentiation of tissues described of cells can be readily determined by employing such 
above from precursor tissues or cells; or for inhibiting the protein or peptide in any known assay for cell chemotaxis. 

growth of tissues described above. The activity of a protein of the invention may, among 

The activity of a protein of the invention may, among 5 other means, be measured by the following methods: 
other meansj be measured by the following methods: " Assays for chemotactic activity (which will identify pro- 
Assays for tissue generation activity .include; without teins that induce or prevent chemotaxis) consist of assays 
limitation, those described in: International Patent Publica- that measure the ability of a protein to induce the migration 
tion No. WO95/16035 (bone, cartilage, tendon); Interna- of cells across a membrane as well as the ability of a protein 
tional Patent Publication No. WO95/05846 (nerve, 10 to induce the adhesion of one cell population to another cell 
neuronal); International Patent Publication No. W091/ population. Suitable assays for movement and adhesion 
07491 (skin, endothelium). include, without limitation, those described in: Current 
Assays for wound healing activity include, without Protocols in Immunology, Ed by J. E. Coligan, A. 2vt 
limitation, those described in: Winter, Epidermal Wound Kruisbeek, D. R Marguiles, E. M. Shevach, W. Strober, 
Healing, pps. 71-112 (Maibach, H. L and Rovee, D. T., 15 Pub. Greene Publishing Associates and Wiley-Interscience 
eds.). Year Book Medical- Publishers, Inc., Chicago, as (Chapter 6.12, Measurement of alpha and beta Chemokines 
modified by Eaglstein and Mertz, J. Invest Dermatol 6.12.1-6.12.28; Taub et aL J. Clin. Invest 95:1370-1376, 
71382-84 (1978). 1995; lind et aL APMIS 103:140-146, 1995; Muller et al 
Activin/Inhibin Activity Eur. J. Immunol. 25:1744-1748; Gruber et al. J. of Immunol. 

A protein of the present invention may also exhibit 20 152:5860-5867, 1994; Johnston et al. J. of Immunol, 

activin- or inhibin-related activities. Inhibins are character- 153:1762-1768, 1994. 

ized by their ability to inhibit the release of follicle stimu- Hemostatic and Thrombolytic Activity 

lating hormone (ESH), while activins and are characterized A protein of the invention may also exhibit hemostatic or 

by their ability to stimulate the release of follicle stimulating thrombolytic activity. As a result, such a protein is expected 

hormone (FSH). Thus, a protein of the present invention, 25 to be useful in treatment of various coagulation disorders 

alone or in heterodimers with a member of the inhibin a (including hereditary disorders, such as hemophilias) or to 

family, may be useful as a contraceptive based on the ability enhance coagulation and other hemostatic events in treating 

of inhibins to decrease fertility in female rnainmais and wounds resulting from trauma, surgery or other causes. A 

decrease spermatogenesis in male mammals. Administration protein of the invention may also be useful for dissolving or 

of sufficient amounts of other inhibins can induce infertility 30 inhibiting formation of thromboses and for treatment and 

in these mammals. Alternatively, . the protein of the prevention of conditions resulting therefrom (such as, for 

invention, as a homodimer or as a heterodimer with other example, infarction of cardiac and central nervous system 

protein subunits of the inhibin-p group, may be useful as a vessels (e.g., stroke). 

fertility inducing therapeutic, based upon the ability of The activity of a protein of the invention may, among 
activin molecules in stimulating FSH release from cells of 35 other means, be measured by the following methods: 

the anterior pituitary. See, for example, U.S. Pat No. 4,798, Assay for hemostatic and thrombolytic activity include, 

885. A protein of the invention may also be useful for without limitation, those described in: Linet et aL, J. Clin, 

advancement of the onset of fertility in sexually immature Pharmacol. 26:131-140, 1986; Burdick et aL, Thrombosis 

mammals, so as to increase the lifetime reproductive per- Res. 45:413-419, 1987; Humphrey et aL, Fibrinolysis 
formance of domestic animals such as cows, sheep and pigs. 40 5:71-79 (1991); Schaub, Prostaglandins 35:467-474, 1988. 

The activity of a protein of the invention may, among Receptor/Iigand Activity 

other means, be measured by the following methods: A protein of the present invention may also demonstrate 

Assays for activin/mhibin activity include, without activity as receptors, receptor ligands ox inhibitors or ago- 

limitation, those described in: Vale et aL, Endocrinology nists of receptor/ligand interactions. Examples of such 
91:562-572, 1972; ling et al., Nature 321:779-782, 1986; 45 receptors and ligands include, without limitation, cytokine 

Vale etaL, Nature 321:776-779, 1986; Mason etaL, Nature receptors and their ligands, receptor kinases and their 

318:659-663, 1985; Forage et aL, Proc. NatL Acad ScL ligands, receptor phosphatases and their ligands, receptors 

USA 833091-3095, 1986. involved in cell-cell interactions and their ligands (including 

Chemotactic/Chemokinetic Activity without limitation, cellular adhesion molecules (such as 

A protein of the present invention may have chemotactic 50 selecting integrins and their ligands) and receptor/ligand 

or chemokinetic activity (e.g., act as a chemokine) for . pairs involved in antigen presentation, antigen recognition 

mamrnaljan cells, including, for example, monocytes, and development of cellular and humoral immune 

fibroblasts, neutrophils, T-cells, mast cells, eosinophils, epi- responses). Receptors and ligands are also useful for screen- 

thelial and/or endothelial cells. Chemotactic and chemoki- ing of potential peptide or small molecule inhibitors of the 
netic proteins can be used to mobilize or attract a desired cell 55 relevant receptor/ligand interaction. A protein of the present 

population to a desired site of action. Chemotactic or chemo- invention (including, without limitation, fragments of recep- 

kinetic proteins provide particular advantages in treatment tors and ligands) may themselves be useful as inhibitors of 

of wounds and other trauma to tissues, as well as in receptor/ligand interactions. 

treatment of localized infections. For example, attraction of The activity of a protein of the invention may, among . 
lymphocytes, monocytes or neutrophils to tumors or sites of 60 other means, be measured by the following methods: 

infection may result in improved immune responses against Suitable assays for receptor-ligand activity include with- 

the tumor or infecting agent out limitation those described inrCurrent Protocols in 

A protein or peptide has chemotactic activity for a par- Immunology, Ed by J. E. Coligan, A. M. Kruisbeek, D. EL 

ticular cell population if it can stimulate, directly or Margulies, B. M. Shevach^ W. Strober, Pub. Greene Pub- 
in directly, the directed orientation or movement of such cell 65 lishing Associates and Wiley-Interscience (Chapter 7.28, 

population. Preferably, the protein or peptide has the ability Measurement of Cellular Adhesion under static conditions 

to directly stimulate directed movement of cells. Whether a 7.28.1-7.28.22), Takai et aL, Proc. Natl. Acad. ScL USA 
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84:6864-6868, 1987; Bierer et al., J. Exp. Med. as an antigen in a vacant composition to raise an immune 

168:1143-1156, 1988; Rosenstein et aL, J. Exp. Med. response against such protein or another material or entity 

169:149-160 1989; Stoltcnborg et al., J. Immunol. Methods which is cross-reactive with such protein. 
175:59-68, 1994; Stitt et aL, Cell 80:661-670. 1995. 

Anti-Liflammatory Activity 5 ADMINISTRATION AND DOSING 

Proteins of the present invention may also exhibit anti- A protein ,h e presen t invention (from whatever source 
inflammatory activity. The anh-inflammatory activity may includin without limitation from recombinant and 
be achieved by providing a stimulus to cells involved in the non^combinant my be osed k a pharmaceutical 
mflammatory response, by inhibiting or promoting cell-cell composition when combined with a pharmaceutical^ 
interactions (such as, for example, cell adhesion), by inhib- jo ^ Such a aHm^ition _ ^ contain (in 
iting or promoting chemotaxis of cells involved in the addition t0 protein and a carrier) diluents, fillers, salts, 
inflammatory process, inhibiting or promoting cell bufferSi so lubilizen, and other materials well 
extravasation, or by stimulating or suppressing production h the art The term "pharmaautically acceptable" 
of other factors which more directly inhibit or promote an means a non4 oxic material that does not interfere with the 
inflammatory response. Proteins exhibiting such activities is effectiveness of ^ biological activity of the active 
can be used to treat irdammtory conditions including ^^^5). ^ characteristics of the carrier wfll depend 
chronic or acute conditions), including without limitation 0Q me route of admmistratioll . The pharmaceutical compo- 
intimation associated with infection (such as septic shock, sition of me invention may contain cytokines, 
sepsis or systeir^ inflammatory response syndrome (SIRS) lyrnphokmes, or omer hematopoietic factors such as M-CSF, 
). lschemia-reperfusion injury, endotoxin lethality, arthritis, 20 GM-CSF, XNF, JLrl. JL-2, IL-3, TLrA, IL-5, IL-6, IL-7, H>8, 
complement-mediated hyperacute rejection, nephritis, il-Q, iho, IL-H, IL-12, E^13, IH4, IH5, IFN, TNFO, 
cytokine or <Aemokine-induced lung injury inflammatory TNF2) thrcmbopoietin, stem cell 
bowel disease. Crohn s disease or resulting from over pro- f ^ erythropoietin. The pharmaceutical composition 
duction of cytokines such asTNF or IL-1. Proteins of the fmtea othff wnicfa ^ cnhaiIce me 
invention may also be useful to treat anaphylaxis and 25 actM of me pr dtein or compliment its activity or use in 
hypersensitivity to an antigenic substance or material. treatment Such additional factors and/or agents may be 
Tumor Inhibition Activity included in the pharmaceutical composition to produce a 
Ja addition to the activities described above for immuno- ^^tic effect with protein of the invention, or to mini- 
logical treatment or prevention of tumors, a protein of the mize ^ Conversely, protein of the present inven- 
lnventaon may exhibit other anti-tumor activities. A protein 30 ^ ^ irjduded h forrnulatioDS of fte particular 
may inhibit "growth directly or indirectly (such as, for ^ other hematopoietic factor, throm- 
example, via ADCQ. A protein may exhibit its tumor Qr antithrombotic factor, or anti-inflammatory agent 
inhibitory activity by acting on tumor tissue or tumor {0 jide effccts of ^ ^ other 
precursor tissue, by mrubiting formation of tissues necessary hematopoietic factor thrombolytic or anti-mrombotic factor, 
to support tumor growth (such as. for example, by inhibiting 35 or anti-inflammatory agent 
angiogencsis). by causing production of other factors, agents . 

or types which inhibit tumor growth, or by suppressing, AP"*™ v*' 5 T"^ ^ ? " 

eliminating or inhibiting factors, agents or cell types which mulnmers (e.g., heterodimers or homocfamers) or complexes 

promote tumor growth. . with itself ox other protons. As a result, pharmaceutical 

Other Activities 40 c 0113 ? 051 * 10115 °* me invention may comprise a protein of the 

A protein of the invention may also exhibit one or more ™ cntioR * ™ ch mamWc OT complexed form, 

of the following additional activities or effects: inhibiting the The pharmaceutical composition of the invention may be 

growth, infection or function of, or killing, infectious agents, in the form of a complex of the protein(s) of present 

including, without limitation, bacteria, viruses, fungi and invention along with protein or peptide antigens. The protein 

other parasites; effecting (suppressing or enhancing) bodily 45 and/or peptide antigen will deliver a stimulatory signal to 

characteristics, including, without limitation, height, weight, °° m B and T lymphocytes. B lymphocytes will respond to 

hair color, eye color, skin, fat to lean ratio or other tissue antigen through their surface immunoglobulin receptor. T 

pigmentation, or organ or body part size or shape (such as, lymphocytes will respond to antigen through the T cell 

for example, breast augmentation or diminution, change in receptor (TCR) following presentation of the antigen by 

bone form or shape); effecting biorhythms or caricadic 50 MHC proteins. MHC and stnicturaily related proteins 

cycles or rhythms; effecting the fertility of male or female including those encoded by class I and dass E MHC genes 

subjects; effecting the metabolism, catabohsm. anabolism, on host cells will serve to present the peptide antigen(s) to 

processing, utilization, storage or elimination of dietary fat, T lymphocytes. The antigen components could also be 

lipid, protein, carbohydrate, vitanuns, minerals, cefaclors or supplied as purified MHC-peptide complexes alone or with 

other nutritional factors or component(s); effecting behav- 55 co-stimulatory molecules mat can directly signal T cells, 

ioral characteristics, including, without limitation, appetite, Alternatively antibodies able to bind surface immunoglobu- 

libido, stress, cognition (inducing cognitive disorders), Iin and other molecules on B cells as well as antibodies able 

depression (including depressive disorders) and violent to bind the TCR and other molecules on T cells can be 

behaviors; providing analgesic effects or other pain reducing combined with the pharmaceutical composition of the inven- 

effects; promoting differentiation and growth of embryonic 60 tion. 

stem cells in lineages other than hematopoietic lineages; The pharmaceutical composition of the invention may be 

hormonal or endocrine activity; in the case of enzymes, in the farm of a liposome in which protein of the present 

correcting deficiencies of the enzyme and treating invention is combined, in addition to otherphnrmflryntically 

defidency-related diseases; treatment of hyperproliferative acceptable carriers, with amphipathic agents such as lipids 

disorders (such as, for example, psoriasis); 65 which exist in aggregated form as micelles, insoluble 

immunoglob ulin-like activity (such as, for example, the monolayers, liquid crystals, or lamrilar layers in aqueous 

ability to bind antigens cr complement); and the ability to act solution. Suitable lipids for liposomal foundation include, 
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without limitation, monoglycerides, diglycerides, sulfatides, or subcutaneous injection, protein of the present invention 

lysolecithin, phospholipids, saponin, bile acids, and the like. will be in the form of a pyrogen-free, parenterally acceptable 

Preparation of such liposomal formulations is within the aqueous solution. The preparation of such parenterally 

level of sldll in the art as disclosed, for example, in U.S. Pat acceptable protein solutions, having due regard to pH, 

Nos. 4.235.871; 4,501.728; 4,837,028; and 4,737323, all of 5 isotonicity, stability, and the like, is within the skillin the art 

which are incorporated herein by reference . A preferred pharmaceutical composition for intravenous, 

■ . , * cutaneous, or subcutaneous injection should contain, in 

As used herein, the term therapeutically effective tQ ^ &c invcntion) m is 0tODic 

amount" means the total amount of each active component ye hide such as Sodium Chloride Injection, Ringer's 

of the pharmaceutical composition ox method that is suffi- injection, Dextrose Injection, Dextrose and Sodium Chlo- 

cient to show a meaningful patient benefit, it, treatment, 10 ride injection, Lactated Ringer's Injection, or other vehicle 

healing, prevention or amelioration of the relevant medical as known in the art The pharmaceutical composition of the 

condition, or an increase in rate of treatment, healing. present invention may also contain stabilizers, preservatives, 

prevention or amelioration of such conditions. When applied buffers, antioxidants, or other additives known to those of 

to an individual active ingredient administered alone, the skill in the art 

term refers to that ingredient alone. When applied to a 15 The amount of protein of the present invention in the 

combination, the term refers to combined amounts of the pharmaceutical composition of the present invention will 

active ingredients that result in the therapeutic effect, depend upon the nature and severity of the condition being 

whether administered in combination, serially or simulta- treated, and on the nature of prior treatments which the 

neously patient has undergone. Ultimately, the attending physician 

Id practicing the method of treatment or use of the present 20 w " de ^* e TT 1 f fS^S ^J^f^T^ 

invention. atterapeuticaUy effective amount of proVcin of ™ ft to . *"* ^ ^"dual parent todaUy, the 

7 ' . J . j • • * . * i . • „ n attending physician will administer low doses of protein of 

invenUon^ administered to a mammal having a md obscrvc ^ patient's response, 

condition to be treated. Protein of the present mvenaon may ^ ^ Qf ^ rf &e presC nt invention may be 

be administered in accordance with the method of the adjmnistcrcd nnt fl mc optima i therapeutic effect is obtained 

invention either alone or in combination with other therapies f or ^ ^ # ^ ^ ^ dosagc ^ not focused 

such as treatments employing cytokines, lymphokuies or further. It is contemplated that the various pharmaceutical 

other hematopoietic factors. When co-administered with one compositions used to practice the method of the present 

or more cytokines, lymphokines or other hematopoietic invention should contain about 0.01 ug to about 100 mg 

factors, protein of the present invention may be acUninistcred (preferably about 0.1 ug to about 10 mg, more preferably 

either simultaneously with the cytokines), Iymphokine(s), about q j ug to aD0Ut j ^ 0 f protein of the present 

other hematopoietic factor(s), thrombolytic or and- invention per kg body weight 

thrombotic factors, or sequentially. If administered ^ of intravenous therapy using the pharma- 

sequentially, the attending physician win decide on the ceutical composition of me present invention will vary, 

appropriate sequence of administering protein of the present ■ de pending on the severity of the disease being treated and 

invention in combination with cytokines), lymphokine(s), ^ cq^^ ^ potential idiosyncratic response of each 

other hematopoietic factor(s), thrombolytic or anti- individual patient It is contemplated that the duration of 

thrombotic factors. cach application of the protein of the present invention wOl 

Administration of protein of the present invention used in be in the range of 12 to 24 hours of continuous intravenous 
the pharmaceutical composition or to practice the method of ^ administration. Ultimately the attending physician will 

the present invention can be carried out in a variety of decide on the appropriate duration of intravenous therapy 

conventional ways, such as oral ingestion, inhalation, topical using the pharmaceutical composition of the present inven- 

application or cutaneous, subcutaneous, intraperitoneal, tion. 

parenteral or intravenous injection. Intravenous administra- Protein of the invention may also be used to immunize 
tion to the patient is preferred. 45 animals to obtain polyclonal and monoclonal antibodies 

When a therapeutically effective amount of protein of the which specifically react with the protein. Such antibodies 

present invention is administered orally, protein of the may be obtained using either the entire protein or fragments 

present invention will be in the form of a tablet, capsule, thereof as an immunogen. The peptide immunogens addi- 

powder, solution or elixir. When administered in tablet form, tionally may contain a cysteine residue at the carboxyl 

the pharmaceutical composition of the invention may addi- 50 terminus, and are conjugated to a hapten such as keyhole 

tionally contain a solid carrier such as a gelatin or an limpet hemocyanin (KLH). Methods for synthesizing . such 

adjuvant The tablet, capsule, and powder contain from peptides are known in the art, for example, as in R- P 

about 5 to 95% protein of the present invention, and pref- Merrifield, J.Amer. Chem. Soc. 85, 2149—2154 (1963); J. L. 

erably from about 25 to 90% protein of the present inven- Krstenansky, et al„ FEBS Lett 211, 10 (1987). Monoclonal 

tion. When administered in liquid form, a liquid carrier such 55 antibodies binding to the protein of the invention may be 

as water, petroleum, oils of animal or plant origin such as useful diagnostic agents for the immunodetection of the 

peanut oil. mineral oil, soybean oil, or sesame oil, or protein. Neutralizing monoclonal antibodies binding to the 

synthetic oils may be added. The liquid form of the phar- / protein may also be useful therapeutics far both conditions 

maceutical composition may further contain physiological associated with the protein and also in the treatment of some . 

saline solution, dextrose or other saccharide solution, or ^ forms of cancer where abnormal expression of the protein is 

glycols such as ethylene glycol, propylene glycol or poly- involved. In the case of cancerous cells or leukemic cells, 

ethylene glycol When administered in liquid form, the neutralizing monoclonal antibodies against the protein may 

pharmaceutical composition contains from about 0.5 to 90% be useful in detecting and preventing the metastatic spread - 

by weight of protein of the present invention, and preferably of the cancerous cells, which may be mediated by the 

from about 1 to 50% protein of the present invention. $5 protein. 

When a therapeutically effective amount of protein of the For compositions of the present invention which are 

present invention is administered by intravenous, cutaneous useful for bone, cartilage, tendon or ligament regeneration, 
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the therapeutic method includes administering the compo- fared sequestering agents include hyaluronic add, sodium 
sition topically, systematically, or locally as an implant or alginate, polyethylene glycol), polyoxyethylene oxide, car- 
device. When administered, the therapeutic composition for boxyvinyl polymer and polyvinyl alcohol). The amount of 
use in this invention is, of course, in a pyrogen-free, physi- sequestering agent useful herein is 0.5-20 wt % preferably 
ologicaliy acceptable. form. Further, the comporiUon may 5 1-10 wt % based on total formulation weight, which rep- 
desirably be encapsulated or injected in a viscous form for . resents the amount necessary to prevent desorbtion of the 
delivery to the site of bone, cartilage or tissue damage. protein from the polymer matrix and to provide appropriate 
Topical administration may be suitable for wound healing handling of the composition, yet not so much that the 
and tissue repair. Therapeutically useful agents other than a progenitor cells are prevented from infiltr a t i ng the matrix, 
protein of the invention which may also optionally be 10 thereby providing the protein the opportunity to assist the 
included in the composition as described above, may alter- osteogenic activity of the progenitor cells, 
natively or additionally, be administered simultaneously or further compositions/proteins of the invention may be 
sequentially with the composition in the methods of the combined with other agents beneficial to the treatment of the 
invention. Preferably for bone and/or cartilage formation, bone and/or cartilage defect, wound, or tissue in question, 
the composition would include a matrix capable of deliver- 15 These agents include various growth factors such as epider- 
ing the protein^ntaining composition to the site of bone mal growth factor (EGF), platelet derived growth factor 
and/or cartilage damage, providing a structure for the devel- (PDGF), transforming growth factors (TGF-GL and TGF-p), 
oping bone and cartilage and optimally capable of being and insulin-like growth factor (IGF), 
resorbed into the body. Such matrices may be formed of ^ fog™^ compositions are also presently valuable 
materials presently in use for other implanted medical 20 fof veterinaiy appl icatioiis. Particularly domestic animals 
applications. ^ and thoroughbred horses, in addition to humans, are desired 
The choice of matrix material is based on patients for such treatment with proteins of the present 
biocompatibility, biodegradability, mechanical properties, invention. 

cosmetic appearance and interface rroperties. The particular ^ efl of a pMdmt ^ pharmaceu- 

apphcauon of the compositions wiU define me appropriate * ^ ^Monto be used in tissue regeneration will be 

formulation. Potential matrices for the compositions may be detenni ^ b thc h idan considering various 

biodegradable and chemically defined Return sulfate, factarswhich modify ^ action of the proteins, e.g., amount 

tncalaumr^ of tissue weight desfred to be formed, me site of damage, me 

lycobc acid and polya^ydndes Omer potential materials «^ ^ ^ ^ 0 f a wouno\ type 

are biodegradable and Really, wefi-defined, such as 30 ^ | ^ t< ^ 

bone or dermal collagen. Further matnees are comprised of fce * of timVof adrnhltration and 

pure proteins or extracellular matrix components. Other othcf clinical factors . ^ dosage ^ vary with the type of 

potential matnees are nonbiodegradable and chemicaUy ^ m ^ rcconstimtion md ^th inclusion of other 

defined, such as sintered faydroxyapatite, bioglass p ro teins in the pharmaceutical compositioiL For example, 

aluminates, or other ceramics. Matnees may be comprised ^ ^ of ^ ^ ^ ^ IQp j 

of combinations of any of the atove mentioned types of ^ ^ ^ factor ^ t0 ^ ^ conation, may 

material, such as polylacnc acid and L hydroxyapatite or ^ ^ ^ ^ Progress can be monitored by peri- 

collagen and tncalciumphosphate. The bioceramics may be ^ assessmeBt ^ ^ sue/bonc &owth and/ox repair, for 

altered in composition, such as in calaum-alummate- ^ histomorphometric determinations and tet- 

phosphate and processing to alter pare size, particle size, w raC y3in e labeling. 

parUcle shape, and biode^dabflity. Polynucleotides of the present invention can also be used 

Itoenayprtfer^ fof ^ m Such ^ ynucleotides ^ ^ introduced 

kctic acid and glycohc acid m ^ &^p^ P^js ^ fa ^oor cx X m to cells for expression in a 

havmga^etersrangmgfroml50^ mammaIian subject poi^^des of the invention may 

apphcation^it wffl be useful to utilize a se^uestermg agent^ ^ b ^ ^ fQr 

suchascarboxymemylcellu^ tion rf nudeic ^ mto a ^ or organism (including, 

prevent the protein compositions from disassociating from Hrn?fflHnn ^ m ^ fom ^ ^ vectors ^ naked 

the matnx- DNA) 

A preferred family of 1^ agents ^ 50 ^ ^ bc aasaed « vi vo i„ the presence of 

materials such as alkylcelluloses (including , . A . ■ . . . 

hydroxyalkylceUuloses), including methylcellulose, Foteins of the present invention ui onto to^Manteor to 

ethylceT^lose, hydroxyethylcellulose, F^uceadesn-edef^onoractiv^tosnAcells.Tie^ 

hyioxypropylceUulose, hydVoxvpropyl-inefliylceUulose, cells can then be mtroduced m vxvo for therapeutic purposes, 

and carboxymethylcellulose, the most preferred being cat- Patent and literature references cited herein are incorpo- 

ionic salts of carboxymethylcellulose (CMC). Other pre- rated by reference as if fully set forth. 



SEQUENCE LISTING 



( 1 ) GENERAL INFORMATION: 

(111 ) NUMBER OF SEQUENCES: U 

( 2 ) INFORMATION FOR SEQ ID NO-.J : 
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-continued 



( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 432 base pairs 
( B )TYPE: cpcIoc add 

( C ) STRANDEDNESS: doable - . ' ; 

( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( x i ) SEQUENCE DESCRIPTION: SEQ ED NOU: 

OOTTTOAAAA CTCTOCTTCC TTTOTGAATT TGGTGTTAGG AGTTCTTATT GTTATTCTOC 60 

AOCCTTTACT ATTOTCCTTT ATTTACTOAA C AC AGTGA AT ACCAAOCACT OTTTATTAOA 120 

GGTTAGGAGT AOGOOCAOOT OATTAAAAAA ACAAAAAAGC TAATAATCTC CTCAAGCAAT 180 

TTCTOOCCTA ATAOAATTAT AOTAOACAOT OAACTATCTA AACCCAGOOA ATCAOATTGA 240 

OOCACCAIOT CCATCOCCTT OAOAATTAAT AGGCTOCATT TCTGGGTTCT CCNTTTTTTT 300 

TTTTTTTTTG CCCAACTGAG TCTTTCTGTG GACTT AC ATG GAACTTCTTA TTCTCTTAAA 360 

TCATTAAGTT ACTTGACAAT ATTCTTGGAT TTOGAGAAAC TOGATGTAGO GCCGTATGAA 420 
AAAATCATTC OA 

( 2 ) INFORMATION FOR SEQ ID N02: 

• 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 62 amino adds 
( B ) TYPE: errono add 
( C ) STRANDEDNESS: 
( D ) TOPOLOOY: linear 

( i i } MOLECULE TYPE: proton 

( x i ) SEQUENCE DESCRIPTION: SEQ E> NOA 

Met Set lie Ala Leu Arg lie A»n Arg Leu Hit Phe Trp Val Leu Xaa 
15 10 15 

Phe Phe Phe Phe Pbe Ala Gin Leu Ser Leu Ser Val A$p Leu His Oly 

2 0 2 5 3 0 

Tbr Ser Tyr Ser Leu Ly I Ser Leu Ser Tyr Leu T h r Ilo Phe Leu A*p 
3 5 40 4 5 

Leu Olu Ly t Leu Aip Val Oly Pro Tyr Gin Lyi lie lie Arg 
5 0 5 5 6 0 

< 2 ) INFORMATION FOR SEQ ID N03: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 219 base pairs 
( B ) TYPE: cactctc add 
( C ) STRANDEDNESS: docblo 

( D ) TOPOLOGY: fiscar . 

( i 1 ) MOLECULE TYPE: cDNA 

(li) SEQUENCE DESCRIPTION: SEQ ID NCO: 
ATAOGATACN OTATCTNOCT TTTTTCATTT AAACOTCONG AGCAATTTTC CC AAOACAT A 60 
ACAAACTGTC T T NO A AAA AN GGAAAACATT NGOGGCTOTC AGCANAACNO AAAATOTTTT 120 
CTGOOTOAOA. CACATOTATC TTNONAATOG GTTOG ATTT A GTGTGCTTTA TTTCAATAAA 180 
A AT TCAOTAT T AT A AT T T A A AAAAAAAAAA AAA A A AAA A 2 19 

( 2 ) INFORMATION FOR SEQ ID NO* 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 501 base pairs 
( B ) TYPE: roddc add 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOOY: 1 
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-continued 



( i i ) MOLECULE TYPE: cDNA 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO?*: 

TCCACAOGTO TCCANTCCCA OGTC C A A C T O C AOA TTT CO A ATICOOCCTT CATGGCCTAO 60 

AOCOACOCOO AOAARAOCTC COOOTOCCOC OOCACTOCAO COCTOAOATT CCTTTACAAA 120 

OAAACTCAOA OOACCGOOAA OAAAOAATTT CACCTTTOCO ACOTOCTAOA AAATAAROTC 180 

OTCTOOOAAA AOOACTOOAO ACACAAOCOC ATCSCAAS Y Y SROTOAAOOA SAAASNOAKO 240 

OANBTAKWWM MOWOSWOAAA AATKT Y WWKC AAMMWMGOTA TTTTCCCTTO OATATTAACT 300 

TOCATATC TO A AO A AAT GOC ATTCCOOACA ATTTOCOTOT TOOTTGOAOT ATTTATTTGT 360 

TCT ATCTOTO TOAAAOOATC TTCCCAOCCC CAAOCAAOAO TTTATTTAAC ATTTOATGAA 420 

CTTCOAOAAA CCAAGACCTC TOAATACTTC AOCCTTTCCC ACCATCCTTT AOACTACAOO 480 

ATTTT ATTAA TOGATGAAGA T 501 

< 2 ) INFORMATION FOR SEQ ID NCt5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 62 *moo ac&fa 
( B ) TYPE: amino add 
( C ) STRANDEDNESS: 
(D ) TOPOLOGY: Enear 

( i i ) MOLECULE TYPE protein 

( x i ) SEQUENCE DESCRIPTION: SEQXDN03: 

Met All Pb« ArgThrlle Cy i Va I Lou Val Gly Val Phe lie Cyi Ser 
1 5 . 10 IS 

lie Cyt Val Lys Gly Ser Ser OU Pro Gin Ala Arg Va 1 Tyr Leu Thr 

20 25 30 

Phe Asp GU Leo Arg Olu Thr Ly» Thr Ser Olo Tyr Phe Ser Leu Ser 
35 40 45 

Hit HI i Pro Leu Alp Tyr Arg lie Leu Len Met A»p GU Asp 
50 55 «0 

( 2 ) INFORMATION FOR SEQ ID NO*: 

( i ) SEQUENCE CHARACTERISTICS: 
{ A ) LENGTH: 302 b«c pain 
( B ) TYPE: andoc add 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY: Enear 

( i i ) MOLECULB TYPE: cDNA 

{ x i ) SEQUENCE r^SCRlPTiON: SEQ H>NO^: 

CTAOCACTAO ACATOTCAIO OTCTTC ATGG TGCATATAAA TATATTTAAC. TTAACCCAOA 60 

TTTTATTTAT ATCTTTATTC ACCTTTTCTT CAAAATCOAT ATOGTOOCTO C A A A AC T AOA 1 20 

ATTGTTGC AT CCCTCAATNG AATOAOOOCC ATATCCCTOT OOTATTCCTT ICCTOCTTNO 180 

OOOCTTTAOA ATT CT A AT T G TCAOTOATTT TOTATATOAA AACAAOTTCC AAATCCACAO 240 • 

CTTTTACOTA OTAAAAOTCA TAAATOCATA TOACAOAATO OCTATCAAAA OA A A A A A A A A 300 

AA 

( 2 ) INFORMATION FOR SEQ ID NOt7: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 443 base pairs 
( B ) TYPE: sadde add 
( C ) STRANDEDNESS: doable 



3 0 2 
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( D ) TOPOLOGY: Eoesr 

( i 1 ) MOLECULE TYPE; cDNA 

( x i ) SEQUENCE DESCRIPTION: SE<J ID NO-.7: 

GOCOAAROCA OCOOCAOOTC OOOAOCAARA TGOCOCTOCO GCCAGOAGCT GOTT CTOOTO 60 
OCOOCOOGOC COCGARGAK Y A T R- 

RYOYORK KT Y YRY YSKO KKWKSMOOST TCATOTTTCC 120 

TOTTOCAGOT OGGATAAOAC CCCCTCAAOO CCTCATGCCO ATOCAGCAAC AAOOATTTCC 180 

T AT GOT CT C T GTCATOCAQC CTAATATOCA AOOCATTATO OOAATGAATT ACAOCTCTCA 240 

GATOTCCCAA OOACCT ATTG CTATGCAGOC AGGAATACCA AT OGO AC C A A TGCCAGCAGC 300 

OOOAATOCCT TACCTAGGAC AAGCACCCTT CCTGGGC AT G COTCCTCCAG GCCCACAGTA 360 

CACTCCAOAC A T GC AG A AO C AGTTTOCCGA AG AGC AG C AG AA ACO AT TTO A AC AG C AO C A 420 

AAAACTCTTA GAAAAAAAAA A A A A A A A A 448 



( 2 ) INFORMATION FOR SEQ ID NO*. 

( i ) SEQUENCE CHARACTERISTICS 
( A ) LENGTH: 107 amino 
( B ) TYPE: annuo add 

( C ) STRANDEDNESS: * 
( D ) TOPOLOGY: Ecear 

( i i ) MOLECULE TYPE: protein 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO& 

Met Phe Pro V a 1 Ala O 1 y Gly I 1 e. Arg Pro Pro Gin Gly Leo Mel Fro 
.1 5 10 1 5 

Mcc Gin Gin Gin Gly Pbe Pro Mot Vol Ser V* 1 Met Gin Pro A » n Me t 

2 0 2 3 3 0 

Gin Gly lie Met Gly Met Am Tyr Ser Ser Ola Met Ser Gin Gly Pro 
35 40 45 

lie Ala Met Gin Ala Gly lie Pro Met Gly Pro Met Pro Ala Ala Gly 
5 0 5 5 6 0 

Met Pro Tyr Leu Gly Gin Ala Pro Pbe Leu Gly Met Arg Pro Pro Gly 
65 70 75 80 

Pro Gin Tyr Tbr Pro Alp Met Gin Ly« Gin Pbe Ala Gla Gin Gin Gin 

8 5 90 95 

Lyt Arg Pbe G I u Gin Gin Oln L y • Leu Lea Gin 

10 0 105 



( 2 ) INFORMATION FOR SEQ ID NOA 

( 1 ) SEQUENCE CHARACTERISTICS: . . 
( A ) LENGTH: 29 base pain 
. . ( B ) TYPE: cuckie add 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: fine* 

( 1 1 ) MOLECULE TYPE: other nodus acid 

( A ) DESCRIPTION: /desc « *X»figosBeleodde H 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO& 

GNGCCTCAAT CTGATTCCCT GOOTTTAOA 



2 9 



( 2 ) INFORMATION FOR SEQ ID NOttO: 

( i ) SEQUENCE CHARACTERISTICS: 
( A > LENGTH: 29 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
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( D ) TOPOLOGY! 



( i i ) MOLECULE TYPE: ocbcr axldc add 
.. ( A ) DESCRIPTION': Mac a 



( i 1 ) SEQUENCE DESCRIPTION: SEQ ED NOrlQ: 



ONCCOOAATO CCATTTCTTC AOATATOCA 



2 9 



( 2 ) INFORMATION FOR SEQ ID NO*J I: 

( 1 ) SEQJJENCB CHARACTERISTICS: 
■ ( A ) LENGTH: 29 hue pairs 
(B ) TYPE: raxlac add 
( C ) STRANDEDNESS: single 
(D)TOPOLOOY:£sear 

(11 ) MQLECULB TYPE* otbcr nucleic add 

( A ) DESOUPnON: ftlesc » ^Egoouclooddo" 

( X 1 ) SEQUENCE DESCRIPTION; SEQ ID NO:ll: 

TNC CAT TOO T ATTCCTOCCT OCATAOCAA 



2 9 



What is claimed is: 

1. An isolated polynucleotide selected from the group 25 
consisting of: 

(a) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:l; 



4. The polynucleotide of claim 1 comprising the nucle- 
otide sequence of SEQ ID NO:l from nucleotide 328 to 
nucleotide 432. 

5. The polynucleotide of claim 1 comprising the nucle- 
otide sequence of the full length protein coding sequence of 



(b) a polynucleotide comprising the nucleotide sequence 30 done BD372_5 deposited under accession number ATCC 



35 



of SEQ ID NO:l from nucleotide 247 to nucleotide 
. 432; 

(c) a polynucleotide comprising the nucleotide sequence 
of SEQ LD NO:l from nucleotide 328 to nucleotide 
432; 

(d) a polynucleotide comprising the nucleotide sequence 
of the full length protein coding sequence of clone 
BD372__5 deposited under accession number ATCC 
98146; 

(e) a polynucleotide encoding the full length protein 
encoded by the cDNA insert of clone BD372.J depos- 
ited under accession number ATCC 98146; 

(f) a polynucleotide comprising the nucleotide sequence 
of the mature protein coding sequence of clone 
BD372_J deposited under accession number ATCC 
98146; 

(g) a polynucleotide encoding the mature protein encoded 
by the cDNA insert of clone BD372_ J 5 deposited under 
accession number ATCC 98146; and 

(h) a polynucleotide encoding a protein comprising the 
amino acid sequence of SEQ ID N02. 

2. The polynucleotide of claim 1 comprising the nucle- 
otide sequence of SEQ ID NO:l. 

3. The polynucleotide of claim 1 comprising the nude- 55 of SEQ ID NO:l. 
otide sequence of SEQ ID NO:l from nucleotide 247 to 

nucleotide 432. 



98146. 

6. The polynucleotide of claim 1 encoding the full length 
protein encoded by the cDNA insert of clone BD372_3 
deposited under accession number ATCC 98146. 

7. The polynucleotide of claim 1 comprising the nucle- 
otide sequence of the mature protein coding sequence of 
clone BD372_5 deposited under accession number ATCC 
98146. 

8. The polynucleotide of claim 1 encoding the mature 
^ protein encoded, by the cDNA insert of clone BD372_^5 

deposited under accession number ATCC 98146. 

9. The polynucleotide of claim 1 encoding a protein 
comprising the amino acid sequence of SEQ ID NO:2. 

10. A vector comprising a polynucleotide of rUim 1 
wherein said polynucleotide is operabry linked to an expres- 
sion control sequence. 

1L A host cell transformed with a vector of claim 2. 

12. The host cell of claim 3, wherein said cell is a 

raanvmajian cell. 

13. A process for producing a protein, which comprises: 

(a) growing a culture of the host cell of claim 3 in a 
suitable culture medium; and 

(b) purifying the protein from the culture, 

14. An isolated gene corresponding to the cDNA sequence 
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ABSTRACT 



The present invention relates to purified DNA sequences 
encoding all or a portion of an osteoclast-specific or -related 
gene products and a method for identifying such sequences. 
The invention also relates to antibodies directed against an 
osteoclast-specific or -related gene product. Also claimed 
are DNA constructs capable of replicating DNA encoding all 
or a portion of an osteoclast-specific or -related gene prod- 
uct, and DNA constructs capable of directing expression in 
a host cell of an osteoclast-specific or -related gene product 



5 Claims, 1 Drawing Sheet 



U.S. Patent 



Sep. 3, 1996 



5,552,281 



1 


AGACACCTCT 


GCCCTCACCA 


TGAGCCTCTG 


GCAGCCCCTG 


GTCCTGGTGC 


TCCTGGTGCT 


61 


GGGCTGCTGC 


TTTGCTGCCC 

• 


CCAGACAGCG 


CCAGTCCACC 


CTTGTGCTCT 


TCCCTGGAGA 


121 


CCTGAGAACC 


AATCTCACCG 


ACAGGCAGCT 


GGCAGAGGAA TACCTGTACC 


GCTATGGTTA 


181 


CACTCGGGTG 


GCAGAGATGC 


GTGGAGAGTC 


GAAATCTCTG 


GGGCCTGCGC 


TGCTGCTTCT 


241 


CCAGAAGCAA 


CTGTCCCTGC 


CCGAGACCGG 


TGAGCTGGAT 


AGCGCCACGC 


TGAAGGCCAT 


.301 


GCGAACCCCA 


CGGTGCGGGG 


TCCCAGACCT 


GGGCAGATTC 


CAAACCTTTG 


AGGGCGACCT 


361 


CAAGTGGCAC 


CACCACAACA 


TCACCTATTG 


GATCCAAAAC 


TACTCGGAAG 


ACTTGCCGCG 


421 


GGCGGTGATT 


GACGACGCCT 


TTGCCCGOGC 


CTTCGCACTG 


TGGAGOGCGG 


TGACGCCGCT 


481 


CACCTTCACT 


CGCGTGTACA 


GCCGGGACGC 


AGACATCGTC 


ATCCAGTTTG 


GTGTCGCGGA 


541 


GCACGGAGAC 


GGGTATCCCT 


TCGACGGGAA 


GGACGGGCTC 


CTGGCACACG 


CCTTTCCTCC 


601 


TGGCCCCGGC 


ATTCAGGGAG 


ACGCCCATTT 


CGACGATGAC 


GAGTTGTGGT 


CCCTGGGCAA 


661 


GGGCGTCGTG 


GTTCCAACTC 


GGTTTGGAAA 


CGCAGATGGC GCGGCCTGCC 


ACTTCCCCTT 


721 


CATCTTCGAG GGCCGCTCCT ACTCTGCCTG CACCACCGAC GGTCG CTCCG ACGGGTTGCC 


781 


CTGGTGCAGT ACCACGGCCA ACTACGACAC CGACGACCGG 


TTTGGCTTCT 


GCCCCAGCGA 


841 


GAGACTCTAC ACCCGGGACG GCAATGCTGA TGGGAAACCC 


TGCCAGTTTC 


CATTCATCTT 


901 


CCAAGGCCAA 


TCCTACTCCG 


CCTGCACCAC GGACGGTCGC TCCGACGGCT ACCGCTGGTG 


961 . 


CGCCACCACC 


GCCAACTACG 


ACCGGGACAA 


GCTCTTCGGC 


TTCTGCCCGA 


CCCGAGCTGA 


1021 


CTCGACGGTG 


ATGGGGGGCA 


ACTCGGCGGG 


GGAGCTGTGC 


GTCTTCCCCT 


TCACTTTCCT 


1081 


GGGTAAGGAG 


TACTCGACCT 


GTACCAGCGA 


GGGCCGCGGA 


GATGGGCGCC 


TCTGGTGCGC 


1141 


TACCACCTCG 


AACTTTGACA 


GCGACAAGAA 


GTGGGGCTTC 


TGCCCGGACC 


AAGGATACAG 


1201 


TTTGTTCCTC 


GTGGCGGCGC 


ATGAGTTCGG 


CCACGCGCTG 


GGCTTAGATC 


ATTCCTCAGT 


1261 


GCCGGAGGCG 


CTCATGTACC 


CTATGTACCG 


CTTCACTGAG 


GGGCCCCCCT 


TGCATAAGGA 


1321 


CGACGTGAAT 


GGCATCCGGC 


ACCTCTATGG 


TCCTCGCCCT 


GAACCTGAGC 


CACGGCCTCC 


1381 


AACCACCACC 


ACACCGCAGC 


CCACGGCTCC 


CCCGACGGTC 


TGCCCCACCG 


GACCCCCCAC 


1441 


TGTCCACCCC 


TCAGAGCGCC 


CCACAGCTGG 


CCCCACAGGT 

> 


CCCCCCTCAG 


CTGGCCCCAC 


1501 


AGGTCCCCCC 


ACTGCTGGCC 


CTTCTACGGC 


CACTACTGTG 


CCTTTGAGTC 


CGGTGGACGA 


1561 


TGCCTGCAAC 


GTGAACATCT 


TCGACGCCAT 


CGCGGAGATT 


GGGAACCAGC 


TGTATTTGTT 


1621 


CAAGGATGGG 


AAGTACTGGC 


GATTCTCTGA 


GGGCAGGGGG 


AGCCGGCCGC 


AGGGCCCCTT 


1681 


CCTTATCGCC 


GACAAGTGGC 


CCGCGCTGCC 


CCGCAAGCTG 


GACTCGGTCT 


TTGAGGAGGC . 


1741 


GCTCTCCAAG 


AAGCTTTTCT 


TCTTCTCTGG 


GCGCCAGGTG 


TGGGTGTACA 


CAGGCGCGTC 


1801 


GGTGCTGGGC 


CCGAGGCGTC TGGACAAGCT 


GGGCCTGGGA 


GCCGACGTGG 


CCCAGGTGAC 


1861 


CGGGGCCCTC 


CGGAGTGGCA 


GGGGGAAGAT 


GCTGCTGTTC 


AGCGGGCGGC 


GCCTCTGGAG 


1921 


GTTCGACGTG 


AAGGCGCAGA 


TGGTGGATCC 


CCGGAGCGCC 


AGCGAGGTGG 


ACCGGATGTT - 


1981 


CCCCGGGGTG 


CCTTTGGACA CGCACGACGT CTTCCAGTAC CGAGAGAAAG CCTATTTCTG 


2041 


CCAGGACCGC 


TTCTACTGGC 


GCGTGAGTTC 


CCGGAGTGAG 


TTGAACCAGG 


TGGACCAAGT 


2101 


GGGCTACGTG ACCTATGACA TCCTGCAGTG CCCTGAGGAC 


TAGGGCTCCC 


GTCCTGCTTT 


2161 


GCAGTGCCAT GTAAATCCCC ACTGGGACCA ACCCTGGGGA AGGAGCCAGT TTGCCGGATA 


2221 


CAAACTGGTA TTCTGTTCTG GAGGAAAGGG AGGAGTGGAG GTGGGCTGGG CCCTCTCTTC 


2281 


TCACCTTTGT 


TTTTTGTTGG 


AGTGTTTCTA 


ATAAACTTGG 


ATTCTCTAAC 


CTTT 
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HUMAN OSTEOCLAST-SPECIFIC AND" 
-RELATED GENES 

RELATED APPLICATION 

5 

This application is a continuation of application Scr. No. 
08/045,270 filed on Apr. 6, 1993 now abandoned 

BACKGROUND OF THE INVENTION 

10 

Excessive bone resorption by osteoclasts contributes to 
the pathology of many human diseases including arthritis, 
osteoporosis, periodontitis, and hypercalcemia of malig- 
nancy. During resorption, osteoclasts remove both the min- 
eral and organic components of bone (Blair, R C, et ah, J. 15 
Cell Biol 102:1164 (1986)). The mineral phase is solubi- 
lized by acidification of the sub-osteoclastic lacuna, thus 
allowing dissolution of hydroxyapaiite (Vaes, G., Clin. 
Orthop. RelaL 231:239 (1988)). However, the mechanism(s) 
by which type I collagen, the major structural protein of 20 
bone, is degraded remains controversial. In addition, the 
regulation of osteoclastic activity is only partly understood. 
The lack of information concerning osteoclast function is 
due in pan to the fact that these cells are extremely difficult 
to isolate as pure populations in large numbers. Furthermore, ' 55 
there are no osteoclastic cell lines available. An approach to 
studying osteoclast function that permits the identification of 
heretofore unknown osteoclast-specific or -related genes and 
gene products would allow identification of genes and gene 
products that are involved in the resorption of bone and in 30 
the regulation of osteoclastic activity. Therefore, identifica- 
tion of osteclast-speciflc or -related genes or gene products 
would prove useful in developing therapeutic strategies for 
the treatment of disorders involving aberrant bone resorp- 
tion. 35 

SUMMARY OF THE INVENTION 

The present invention relates to isolated DNA sequences 
encoding all or a portion of osteoclast-specific or -related ^ 
gene products. The present invention further relates to DNA 
constructs capable of replicating DNA encoding osteoclast- 
specific or -related gene products. In another embodiment, 
the invention relates to a DNA construct capable of directing 
expression of all or a portion of the osteoclast-specific or A5 
-related gene product in a host cell. 

Also encompassed by the present invention are prokary- 
otic or eukaryotic cells transformed or transfected with a 
DNA construct encoding all or a portion of an osteoclast- 
specific or -related gene product. According to a particular 50 
embodiment, these cells are capable of replicating the DNA 
construct comprising ' the DNA encoding the osteoclast- 
specific or -related gene product, and, optionally, are capable 
of expressing the osteoclast-specific or -related gene prod- 
uct. Also claimed are antibodies raised against osteoclast- 55 
specific or -related gene products, or portions of these gene 
products. 

The present invention further embraces a method of 
identifying osteoclast-specific or -related DNA sequences 
and DNA sequences identified in this manner. In one 60 
embodiment, cDNA encoding osteoclast is identified as 
follows: First, human giant cell tumor of the bone was used 
to I) construct a cDNA library; 2) produce ^P-labelled 
cDNA to use as a stromal cell*; osteoclast* probe, and 3) 
produce (by culturing) a stromal cell population lacking 65 
osteoclasts. The presence of osteoclasts in the giant cell 
tumor was confirmed by histological staining for the ostco- 
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clast marker, type 5 tartrate-resi stant add phosphatase 
(TRAP) and with the use of monoclonal antibody reagents. 

The stromal cell population lacking osteoclasts was pro- 
duced by dissociating cells of a giant cell tumor, then 
growing and passaging the cells in tissue culture until the 
cell population was homogeneous and appeared fibroblastic. 
The cultured stromal cell population did not contain osteo- 
clasts. The cultured stromal cells were then used to produce 
a stromal celT, osteoclast" 32 P-labelled cDNA probe. 

The cDNA library produced from the giant cell tumor of 
the bone was then screened in duplicate for hybridization to 
the cDNA probes: one screen was performed with the giant 
cell tumor cDNA probe (stromal cell*, osteoclast 4 ), while a 
duplicate screen was performed using the cultured stromal 
cell cDNA probe (stromal cell*, osteoclast"). Hybridization 
to a stromal*, osteoclast* probe, accompanied by failure to 
hybridize to a stromal*, osteoclast' probe indicated that a 
clone contained nucleic acid sequences specifically 
expressed by osteoclasts. 

In another embodiment, genomic DNA encoding osteo- 
clast -specific or -related gene products is identified through 
known hybridization techniques or amplification techniques. 
In one embodiment, the present invention relates to a 
method of identifying DNA encoding an osteoclast-specific 
or -related protein, or gene product, by screening a cDNA 
library or a genomic DNA library with a DNA probe 
comprising one or more sequences selected from the group 
consisting of the DNA sequences set out in Table I (SEQ ID 
NOs: 1-32). Hnally, the present invention relates to an 
osteoclast-specific or related protein encoded by a nucle- 
otide sequence comprising a DNA sequence selected from 
the group consisting of the sequences set out in Table I, or 
their complementary strands. 

BRIEF DESCRIPTION OF FIG. 1 

The FIG. 1 shows cDNA sequence (SEQ ID NO: 33) of 
human gelatinase B, and highlights those portions of the 
sequence represented by the osteoclast-specific or -related 
cDNA clones of the present invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

As described herein, Applicant has identified osteoclast- 
specific or osteoclast-rclated nucleic acid sequences. These 
sequences were identified as follows: Human giant cell 
tumor of the bone was used to 1) construct a cDNA library; 
2) produce 32 P-Iabelled cDNA to use as a stromal ceir, 
osteoclast + probe, and 3) produce (by culturing). a stromal 
cell population lacking osteoclasts. The presence of oste- 
clasts in the giant cell .tumor was confirmed by histological 
staining for the osteoclast marker, type 5 acid phosphatase 
(TRAP). In addition, monoclonal antibody reagents were 
used to characterize the multinucleated cells in the giant cell 
tumor, which cells were found to have a phenotype distinct 
from macrophages and consistent with osteoclasts. 

The stromal cell population lacking osteoclasts was pro- 
duced by dissociating cells of a giant cell tumor, then 
growing the cells in tissue culture for at least five passages. 
After five passages the cultured cell population was homo- 
geneous and appeared fibroblastic. The cultured population 
contained no multinucleated cells at this point, tested nega- 
tive for type 5 acid phosphatase, and tested variably alkaline 
phosphatase positive. That is, the cultured stromal cell 
population did not contain osteoclasts. The cultured stromal 
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cells were then used to produce a stromal cell*, osteoclast" 
3a P-labellcd cDNA probe. 

The cDNA library produced from the giant celt tumor of 
the bone was then screened in duplicate for hybridization to 
the cDNA probes: one screen was performed with the giant 
cell tumor cDNA probe (stromal celT, osteroclasf), while a 
duplicate screen was performed using the cultured stromal 
cell cDNA probe (stromal ceir osteoclast") Clones thai 
hybridized to the giant cell tumor cDNA probe (stromal"", 
osteoclast"), but not to. the stromal cell cDNA probe (stro- 
mal*, osteoclast"), were assumed to contain nucleic acid 
sequences specifically expressed by osteoclasts. 

As a result of the differential screen described herein, 
DNA specifically expressed in osteoclast cells characterized 
as described herein was identified. This DNA, and equiva- 
lent DNA sequences, is referred to herein as osteoclast- 
specific or osteoclast-related DNA. Osteoclast-specific or 
-related DNA of the present invention can be obtained from 
sources in which it occurs in nature, can be produced 
recombinantly or synthesized chemically; it can be cDNA, 
genomic DNA, recombinantiy-produced DNA or chemi- 
cally-produced DNA. An equivalent DNA sequence is one 
which hybridizes, under standard hybridization conditions, 
to an osteoclast-specific or -related DNA identified as 
described herein or to a complement thereof. 

Differential screening of a human osteoclastoma cDNA 
library was performed to identify genes specifically 
expressed in osteoclasts. Of 12,000 clones screened, 195 
clones were identified which are either uniquely expressed 
in osteoclasts, or are osteoclast-related. These clones were 
further identified as osteoclast-specific, as evidenced by 
failure to hybridize to mRNA derived from a variety of 
unrelated human cell types, including epithelium, fibro- 
blasts, lymphocytes, myeloraonocytic cells, osteoblasts, and 
neuroblastoma cells. Of these, 32 clones contain novel 
cDNA sequences which were not found in the GenBank 
database. 

A large number of cDNA clones obtained by this proce- 
dure were found to represent 92 kDa type IV collagcnase 40 
(gelatinase B; E.C. 3.4,2435) as well as tartrate resistant 
acid phosphatase. In situ hybridization localized mRNA for 
gelatinase B to multinucleated giant cells in human osteo- 
clastomas. Gelatinase B immunoreactivity was demon- 
strated in giant cells from 8/8 osteoclastomas, osteoclasts in 45 
normal bone, and in osteoclasts of Paget' s disease by use of 
a polyclonal aniisera raised against a synthetic gelatinase B 
peptide. In contrast, no immunoreactivity for 72 kDa type IV 
collagenase (gelatinase A; EC 3.4.24.24), which is the 
product of a separate gene, was detected in osteoclastomas 50 
or normal osteoclasts. 

The present invention has utility for the production and 
identification of nucleic acid probes useful for identifying 
osteoclast-specific or -related DNA. Osteoclast-specific or 
•related DNA of the present invention can be used to 55 
produce osteoclast-specific or -related gene products useful 
in the therapeutic treatment of disorders involving aberrant 
bone resorption. The osteoclast-specific or -related 
sequences are also useful for generating peptides which can 
then be used to produce antibodies useful for identifying 60 
osteoclast-specific or -related gene products, or for altering 
the activity of osteoclast-specific or -related gene products. 
Such antibodies are referred to as osteoclast-specific anti- 
bodies. Osteoclast-specific antibodies are also useful for 
identifying osteoclasts. Finally, osteoclast -specific or -re- 65 
latcd DNA sequences of the present invention arc useful in 
gene therapy. For example, they can be used to alter the 
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expression in osteoclasts of an aberrant osteoclast -specific 
or -related gene product or to correct aberrant expression of 
an osteoclast-specific or -related gene product The 
sequences described herein can further be used to cause 
osteoclast-specific or related gene expression in' cells in 
which such expression does not ordinarily occur, ix., in cells 
which are not osteoclasts. 

Example 1— Osteoclast cDNA Libary Construction 

Messenger RNA (mRNA) obtained from a human osteo- 
clastoma ('giant cell tumor of bone*), was used to construct 
an osteoclastoma cDNA library. Osteoclastomas are actively 
bone rcsorptive tumors, but are usually non-tnetastatic. In 
cryostat sections, osteoclastomas consist of -30% multi- 
nucleated cells positive for tartrate resistant acid phos- 
phatase (TRAP), a widely utilized phenotypic marker spe- 
cific in vivo for osteoclasts (Minlcin, Calcif. Tissue Int. 
34285-290 (1982)). The remaining cells are uncharactcr- 
ized 'stromal' cells, a mixture of cell types with fibroblastic/ 
mesenchymal morphology. Although it has not yet been 
definitively shown, it is generally held that the osteoclasts in 
these tumors are norf-transforrned, and are activated to 
resorb bone in Yivo by substance(s) produced by the stromal 
cell element 

Monoclonal antibody reagents were used to partially 
characterize the surface phenotype of the multinucleated 
cells in the giant cell tumors of long bone. In frozen sections, 
all multinucleated cells expressed CD68, which has previ- 
ously been reported to define an antigen specific for both 
osteoclasts and macrophages (Horton, M. A. and M. H. 
Helfrich, In Biology and Physiology of the Osteoclast, B. R. 
Rifldn and C. V. Gay, editors, CRC Press, Inc. Boca Raton, 
Fla., 33-54 (1992)). In contrast, no staining of giant cells 
was observed for CDllb or CD14 surface antigens, which 
are present on monocyte/macrophages and granulocytes 
(Amaout, M. A. et al. / Cell Physiol 137:305 (1988); 
Hanoi, A. et al. J. Immunol 141:547 (1988)). Cytocentri- 
fuge preparations of human peripheral blood monocytes 
went positive for CD68, CDllb, and CD14. These results 
demonstrate that the multinucleated giant cells of osteoclas- 
tomas have a phenotype which is distinct from that of 
macrophages, and which is consistent with that of osteo- 
clasts. 

Osteoclastoma tissue was snap frozen in liquid nitrogen 
and used to prepare poly A* mRA according to standard 
methods. cDNA cloning into a pcDNAII vector was carried 
out using a commercially-available kit (Librarian, InVitro- 
gen). Approximately 2.6x10* clones were obtained, >95% 
of which contained inserts of an average length 0.6 kB. 

Example 2 — Stromal Cell mRNA Preparation 

A portion of each osteoclastoma was snap frozen in liquid 
nitrogen for mRNA preparation. The remainder of the tumor 
was dissociated using brief trypsinization and mechanical 
disaggregation, and placed into tissue culture. These cells 
were expanded in Dulbecco's MEM (high glucose, Sigma) 
supplemented with 10% newborn calf serum (MA Byprod- 
ucts), gentamycin (0.5 mg/ml),. 1-glutamine (2 mM) and 
non-essential amino acids (0.1 mM) (Gibco). The stromal 
cell population was passaged at least five times, after which 
it showed a homogenous, fibroblastic looking cell popula- 
tion that contained no multinucleated cells. The stromal cells 
were mononuclear, tested negative acid phosphatase, and 
tested variably alkaline phosphatase positive. Tbcsc findings 
indicate that propagated stromal cells (i.e. stromal cells that 
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arc passaged in culture) are non-osteoclastic and non-acti- 
vated. 

Example 3— Identification of DN A Encoding 
Osieoclasio ma-Specific or -Related Gene Products 
by Differentia] screening of an Osteoclastoma 
cDNA library 

A total of 12,000 clones drawn from the osteoclastoma 
cDNA library were screened by differential hybridization, 
using mixed 32 P labelled cDNA probes derived from (1) 
giant cell tumor mRNA (stromal ceir, OCT), and (2) mRNA 
from stromal cells (stromal cell*. OC*) cultivated from the 
same tumor. The probes were labelled with ^fPJdCTP by 
random priming to an activity of -ltPCPM/ug. Of these 
12,000 clones, 195 gave a positive hybridization signal with 
giant cell (i.e., osteoclast and stromal cell) mRNA, but not 
with'stromal cell mRNA. Additionally, these clones failed to 
hybridize to cDNA produced from mRNA derived from a 
variety of unrelated human cell types including epithelial 
cells, fibroblasts, lymphocytes, myelomonocytic cells, 
osteoblasts, and neuroblastoma cells. The failure of these 
clones to hybridize to cDNA produced from mRNA derived 
from other cell types supports the conclusion that these 
clones are either uniquely expressed in osteoclasts, or are 
osteoclast-related. 

The osteoclast (OC) cDNA library was screened for 
differential hybridization to OC cDNA (stromal cell*, OCT) 
and stromal cell cDNA (stromal cell*, OCT) as follows: 



buffer consisted of SxSSC, 5xDenhardt's solution, 1% SDS 
- and 100 ug/ral denatured heterologous DNA. 

Prior to hybridization, labeled probe was denatured by 
. heating in lxSSC for 5 minutes at 100° C n then immediately 
S chilled on tee. Denatured probe was added to the filters in 
. hybridization solution, and the filters hybridized with con- 
tinuous agitation for 12-20 hours at 65° C 

After hybridization, the filters were washed in 2xSSC/ 
0.2% SDS at 50°-60° C for 30 minutes, followed by 
10 washing in 0.2xSSO0.2% SDS at 60° C. for 60 minutes. 

. . The filters were then air dried and autoradiographed using 
an intensifying screen at -70° C overnight 

Example A — DNA Sequencing of Selected Clones 

15 Q ones reactive with the mixed tumor probe, but unreac- 
tivc with the stromal cell probe, are expected to contain 
either osteoclast-related, or in vivo 'activated' stromal -cell- 
related gene products. One hundred and forty-four cDNA 
clones that hybridized to tumor cell cDNA, but not to 
20 stromal cell cDNA, were sequenced by the dideoxy chain 
. termination method of Sanger et al. (Sanger E, et aL Proc. 
Natl Acad. Set USA 74:5463 (1977)) using sequenase (US 
Biochemical). The DNAS1S (Hitatchi) program was used to 
carry out sequence analysis and a homology search in the 
23 CenBank/EMBL database. 

Fourteen of the 195 tumor* stromal" clones were identi- 
fied as containing inserts with a sequence identical to the 
osteoclast marker, type 5 tartrate-resistant acid phosphatase 
(TRAP) (GenBank accession number J04430 Ml 9534). The 



NYTRAN filters (Schleicher & Schuell) were placed on 30 high representation of TRAP positive clones also indicates 

agar plates containing growth medium and ampicillin. Indi- the effectiveness of the screening procedure in enriching for 

vidua! bacteria] colonics from the OC library were randomly - clones which contain osteoclast-specific or related cDNA 

picked and transferred, in triplicate, onto filters with prer- sequences. 

ulcd grids and then onto a master agar plate. Up to 200 interestingly, an even larger proportion of the tumor* 

colonies were inoculated onto a single 90-mm filter/plate 35 stroma j- dones (77/195; 39 J5%) were identified as human, 

using these techniques. The plates were inverted and incu- gelatinase B (macrophage-derived gelatinase) (Wflhelm, S. 

bated at 37° C. until the bacterial inoculates had grown (on M j BioL Ckem 2 64:17213 (1989)), again indicating high 

the filter) to a diameter of 03-1.0 mm. expression ofthis enzyme by osteoclasts. TVenty-five of the 

The colonies were then lysed, and the DNA bound to the gelatinase B clones were identified by dideoxy sequence 

filters by first placing the filters on top of two pieces of 40 analysis; all 25 showed 100% sequence homology to the 

Whatman 3 MM paper saturated with 0.5N NaOH for 5 published gelatinase B sequence (Genbank accession num- 

minutes. The filters were neutralized by placing on two Dcr J05070). The portions of the gelatinase B cDNA 

pieces of Whatman 3 MM paper saturated with 1M Tris- sequence covered by these clones is shown in the FIGURE 

HCL, pH 8.0 for 3-5 minutes. Neutralization was followed (SEQ ID NO: 33). An additional 52 gelatinase B clones were 

by incubation on another set of Whatman 3 MM papers *5 identified by reactivity with a ^P-labelled probe forgelati- 



saturated with 1M Tris-HCL, pH 8.0/1 .5M NaQ for 3-5 
minutes. The filters were then washed briefly in 2xSSC. 

DNA was immobilized on the filters by baking the filters 
at 80° C for 30 minutes. Filters were best used immediately, 
but they could be stored for up to one week in a vacuum jar 
at room temperature. 

Filters were prebybridized in 5-8 ml of hybridization 
solution per filler, for 2-4 hours in a heat scalable bag. An 
additional 2 ml of solution was added for each additional 
filter added to the hybridization bag. The hybridization 
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nase B. 

Thirteen of the sequenced clones yielded no readable 
sequence. A DNASIS search of GenBank/EMBL databases 
revealed that, of the remaining 91 clones, 32 clones contain 
novel sequences which have not yet been reported in the 
databases or in the literature. These partial sequences are 
presented in Table L Note that three of these sequences were 
repeats, indicating fairly frequent representation of mRNA 
related to this sequence. The repeat sequences are indicated 
by* * superscripts (Clones 198B, 223B and 32C of Table I). 



TABLE I 



34A(SEQIDNO: 1) 
1 GCAAATAXCT 
61 AATGTTTCTA 
121 GTGATATTCT 
4B (SEQ ID NO. 2) 
1 GTGTCAACCT 



PARTIAL SEQUENCES OF 32 NOVEL OC-SPECIFIC OR •RELATED 
EXPRESSED GENES (cDNA CLONES) 



AAGTT TATTC 

Gooiu nn 

CTTTGAATAA 
CCATATOCTA 



C TTGGATTTC 
AO II lOll 11 
ACCTATAATA 

AAAATGTCAA 



TAOTCAGAGC 
TATTGAAAAA 
GAAAATAGCA 

AATCCTGCAT 



TGTTGAATTT 

TTTAATTATT 

GCAGACAACA 

CTOGTTAATG 



GGTGATGTCA 
TATGCTATAG 



TCGOGGTAGG 
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PARTIAL SEQUENCES OF 32 NOVEL OC SPECIFIC OR -RELATED 
EXPRESSED GENES (eDNA CLONES) 



61 GGO 

12B(SEQID NO:j) 
1 CTTCCCTCTC 

61 CAGGCCCACA 
121 CAACCAGCTG 
2SB (SEQ ID NO: 4) 
1 TTTTATTTCT 
61 CTGTGTTTTC 
121 AAAOCAAACT 
37B CSEQ DO NO: 5) 
1 GGCTGGACAT 
61 TTGCCCTGGC 
121 AGCCACTTTG 
181 ACAAAAAAAA 
55S (SEQ ID NO: 6} 
1 TTGACAAAGC 
61 AAGAGTAGTG 
121 TAATTTGCCT 
60B (SEQ ID NO: 7) 
1 GAAGAGAGTT 
61 GATCCCGAGG 
86B (SEQ ID NO: 8) 
1 GGATGGAAAC 
61 GCAAACCTGA 
121 TGGTTGCTGT 
87B (SEQ ID NO: 9) 
I TTCTTGATCT 
61 TAGGAGCCGT 
181 CAATGATAAA 
98B (SEQ ID NO: 10) 
1 ACCCATTTCT 
61 CTCAAAGAAT 
121 GAATATGAGG 
HOB (SEQ ID NO: 11) 
1 ACATATATTA 
61 TAAAGTCGGA 
121 TAACTTTTTT 
MSB (SEQ ID NO: 12) 
1 CCAAATTTCT 
61 TTTGACTACT 
133B (SEQ ID NO: 13) 
I AACTAACCTC 
61 CCTGAGCCAT 
121 AAAT 
I40B(SEQIDNO: 14) 
1 ATTATTATTC 
6! AAAACACACA 
121 GATAAACCCO 
I44B(5EQE>NO: 15) 
1 CGTGACACAA 
61 AACAGCATGT 
198B»(SEQIDNO: 16) 
1 ATAGGTTAGA 
61 ATCTGACTTC 
121 TCTACTCCAA 
181 ATGTGATTTG 
241 TTTAAT 
212B(SEQIDNO:17) 
1 GTCCAGTATA 
61 CCTCTAGATA 
121 AATGGCCTTC 
181 TCTGGAGC 
223 B* (SEQ ID NO: 18) 
1 GCACTTGGAA 
61 TGTTCAGTTT 
121 CCATGACCTT 
181 TAAGAGATGT 
241B(S£QID NO: 19 ) 
I _ TGTTAGTTTT 
61 CTAGACGTCC 
121 GGAAGGGCTC 
181 CTATATOAGC 
32C* (SEQ ID NO: 20) 
1 CCTATTTCTG 
121 TCCGTCTACC 
161 CGGTGGAAGG 



TTCCTTCCCr 

GGGAGTACTG 

GTGGTGAATG 

AAATATATGT 

GTCTTGCTTC. 

CGCGGOATGG 

GGGTGCOCTC 
CATGTCATCT 
TTACGOGACG 
AAAAAAA 

TGTTTATTTC 
GCTATTATAT 
TC 

GTATGT ACAA 
GAATT 

ATGTAGAAGT 
GATTTCAGCA 
TGCACGTATC 

T TAGAA CACT 
GCTTTTGGAA 
ACTTGACAAA 



AACAATTTTT 

AGAGGCAATA 

ACAAGCTCTA 

ACAGCATTCA 
ATGTATCAAC 
TTTTTACATT 

CTGGAATCCA 
CCAGC 

CTCGGACCCC 
GGCCATCCCT 



TTTTTTTATG 

TCCCATTGAA 

GCACGTCCTG 

ACATGCATTC 
TCATCAGCAG 

TTCTCATTCA 
TCACTTCCTA 
TTCATAAATC 
TCTTCCCTTC 



AAGGAAAGCG 
AAACACCCGA 
TACACATTAG 



GGGAGTTGGT 
CCCCATTTGT 
TTTCACTGTG 
GACTACAGCC 

TAGGAAGGCC 
TATAGTTAGT 
TTTGCTAGTA 
ATAGTAAGGC 

ATCCTGACTT 

AGAGCGTGCA 

GGCAGGATTC 



TTCCCAAGCA 
CCAGACTACT 
CTGCCTGGCA 

ATTACATCCC 
TTCATGGTCC 
AAGCAGATTA 

CACGTCCCTC 
ACCTGGAGTG 
ATTTCCCAGA 



CACCAATAAA 
GGGGTATCAT 



CCCCAACAGG 



CCAGAGAAAA 

TAAAATCTTT 

AATAGCTTAT 

ATGAATAGGG 
TGCTTGAGTG 
A 

ACTGTAAAAT 
TATAGCCCAT 
GTGGTCATTA 

TTTGGCCAAA 

TATAGACTAT 

ATAAAATTAA 

TCCTCCCTCC 



TGCCTCACTC 
TATOAGCOCC 



TT AGCTTA GC 
GGGTTTTGTA 
ATAGGAAATT 

GTTTTATTCA 
GAAGCTGGCC 

C GGGA CTAGT 
AGTTCCCTCT 
TATTCATAAG 
TTTGCACTTT 



TTAAGTCGGT 
TTAACAGATG 
CTCCAGCTAA 



GTGCTATTTT 
TTGTGCTTCA 
GCCATCAAGG 
TGCCCCTGAC 

TGICllCTGG 
CACTGGGGAT 
TCTCCATTTC 
TGT 

TGGACAACGC 

CTTGTGATCC 

TGCAGCTGCT 



GAGClGCICA 

GCTGATCTTC 

CGGGACCCCC 

TAGAAAAAGA 

ATQA TGCCAG 

TTCTGCCATT 

ATATCCCCAG 

GGOCCTCCCCC 

CCACTCATCA 



TAGTA TATGG 
GTTGATGCTC 



CAAGCCAGCT 



ACAATTTTAA 

AGTTAGAAGT 

C 

AAAAAAGAAA 
AGGAGCTCAA 



TTTTGGTCAA 
CTTACTAGAC 
AACCCCTCAG 

ATCTACACGT 

CA AAGTG CAA 

CTTGTTT 

CATCACCATA 



ATTTACACCA 
OCAGTGATTA 



CATGCAAAAT 
CATTTCAGTC 
C 

TAAAACAGCC 
GTGGGCAGGG 

TAGCTTTAAG 
TATATCCTCA 
TCTTTGGTAC 
TRAAATAAAG 



AAGCTAGAGG 

TTAACCTTTT 

AAAGACACAT 



TGAAGCAGAT 
AATGATCCTT 
ACTTTCCTGA 
TG 

GAGTOAGGTT 
GGTGAAAGAG 
TAGAAGATGG 



CCTTCAGCCA 
TAAAATAAGC 
TTTGCATTTC 



CTCCATGGCC 
TCTTAAGCCC 

ccc 

ATCCC AGCAT 
CTGAGGTTGT 
TTTCCAGGTC 

GCACACTCTG 
TTCTTCAGCC 
CATTAAAAAA 



TGATTGGGGT 
ATAAATAGTT 



AAATGCAGAG 



AAAAAGGTGG 
GAOAGAAAGA 



AAACTGTTCA 
CAAGTCCTCT 



AGTTCTAAGC 

ATACAGTATT 

AA 

TTGTAGAATC 
ATAACAAGTC 



GCCTCGAOAC 



ACCAOCCAAC 
TAGGCnTCG 



TTACTGGTGA 
CTTACAAATA 



TGGTTTCCTA 
CCGCC 

CACCCTAGAG 
AGGTAGAAAT 
AAGTTACATG 
TATTTATCTC 



AT TGTAA ATA 
ATGTTTTGAT 
TGAGAGCTTA 



GTG CTGAT AC 
CCTACTTTGC 
CAGCTTGTGT 



TATTAGTCCA 

GGAGAAGAGG 

TTTAGATGAT 



OAAGACTGAC 

TTCATCTCCG 

TCTTCCTAAA 



ACCGCCACCA 
CAGGGAGTCT 



TTTCCCTCCT 
CAGTACAATG 



GCCTCAGGTT 
TTOAATCAAA 
TATTTTGAAA 



TTCTATTTAT ' 
CATATCTACT 



GGTACAGAGA 



AAAAGTTACG 
AGAGGGAGGC 



AAATAAAATG 
CCCAAGAAAG 



TTAATCACAT 
AAACTGGACT 



CTACTGTATA 
AAGGTTAGAT 



GTCATTTCTG 



TATCTATAAA 
CTCTAAGATA 



AGCAGTTAAT 
ACAAAGCAAT 



AAACAATACA 



GACTAGGGTA 
GTCTATGTTT 
ATAAAAAGAA 
CTGTCTACAG 



TCTTTTATGT 

TTGCTTTAAA 

GAGGATAGTC 



TGAOATTGTC 
TTCTCTCCAC 
ACTCTTAGGC 



CTTCTTGGAG 

AAGGGCGAAG 

AACCACAGGT 



A AAGT CATCC 
GCTGTGCCTT 
TTTCATT 
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PARTIAL SEQUENCES OF 32 NOVEL OOSPEOFIC OR -RELATED 
EXPRESSED GENES (cDNA CLONES) 



34C (SEQ ED NO: 21) 
1 CGGACOGTAO 
61 CCCCCCOCAC 
47C (SEQ ID NO: 22) 
1 TTAGTTCAGT 
61 GTGGCAGCTG 
121 GGAGCTGACC 
65C (SEQ ID NO: 23) 
i GCTGAATCTT 
61 TGCAAGTGTG 
121 AACTGCCCGT 
79C (SEQ DD NO: 24) 
1 GGCACTCCGA 
61 AGAAAACTGG 
121 CATTGCCAAC 
84C (SEQ ID NO . 25) 
1 GCCAGGGCGG 
61 GACCTGCAGT 
121 CGTGCCTGAG 
86C (SEQ ID NO: 26) 
1 AACTCTTTCA 
61 GTTCATATCA 
121 TTCAATTATA 
g7C (SEQ ID NO: 27) 
1 GGATAAGAAA 
61 CGCAGCAGCC 
121 GTCCTGGTTG 
88C (SEQ ID NO: 23) . 

i CTCACcrrcG 

61 . TGTTCAAOGG 
89C (SEQ ID NO: 29) 
1 ATOCCTGGCT 
61 TCCCTGAGTT 
121 TCGTTTTCTG 
101C(SEQIDNO:30) 
1 GGCTGGGCAT 
61 GTGCCAGCCC 

121 ccTTAGcrrr 

112C (SEQ E> NO: 31) 
J CCAACTOCTA 
161 CAATACTCTC 
1l4C(SEQIDNO: 32) 
1 CATGGATGAA 



GTOTGTTTAT 
CCATCACCCC 

CAAACCAGGC 

GGGAGGTTTC 

CAGAGTGGA 

TAAGAGAGAT 
AATTACGTGG 
TTAGAGTCCT 

TATGGAATCC 

GGAAACAAAG 

CTGGCCAGCr 

ACCGTCTTTA 
GGGOOCTAGT 
TAGAACTTGT 

CACTCTGGTA 

ATTCATATTG 

AGAATATATC 

GAAGGCCTGA 
CGCACAGGTT 
GCCGGTGGAG 

AOAGTTTGAC 
AGCCGTGAGC 

GTGGATAGTG 
TCGGAGTGTG 
GTOATGTTGT 

COCTCTCCTC 

GGCTCTGAAO 

CCCATAAGGT 

CCGCGATACA 
CTAAAATAAA 

TGTCTCATGG 



TCCTGTACAA 
AGTGCAATGG 

AACCCCCTTT 
CCCAACACCC 



TTTGGTCTTA 
TATGGATGGT 
CTTAATATTG 

AGAAGGGAAA 

GATATATCCT 

TCCOCAAGAT 

TTCCTCTCCT 
CATCTGTGGC 
TCTGGAATTC 

TITTTAGTTT 

AGCrG TCTCA 

CTAATACTTT 

GGCCTAGGGG 

GAGAGGGGCA 

AGOCACAAAA 

CFGGAGOCGG 
GACGACTCCG 

CTTTTGTGTA 

GAAGTACTAC 

GCTAACAATA 

CTCCATCCCC 

CCAAGGGCCG 

TGGAGTATCT 

GACCCACAGA 
CATGAAGCAC 

TGGGAAGGAA 



ATCATTACAA 
CTAGCTGCTG 

GGCACTGCTG 
TCCTCTGCTT 



AAGGCTTCAT 

TGCTTGTTTA 

ATGTCCTAAC 

CAAGCACTGG 
CATGGCTCGA 
GTGACTCCAG 

GCCTCAGAGG 
AGCGAAGGTG 
C 

AACAATATAT 
TTCTTTTTTT 
TTAAAA 

CCGRGGCTGG 

crrccrcTTG 



ATACCTACTG 
GTGGGGAAGT 

GCAAATGCTC 

TTAACTGTCT 

AGAATAC 

ATACATCACC 

TCCGTGCCAC 

CC 

GTGCCATCCC 



CATGGTACAT 



AACCAAGTCT 
GCCTTT 

CCACTGGGGT 
CCCTtrTGTGT 



CATOAAAGTO 
TTAACTAAAG 
ACTGGGTCTG 

ATAATTAAAA 

AATAAGAACA 

CCAGAAA 

TCAGGAAGGA 
AAGGGACTCA 



GTGJ IG1G1C 
AATGGTCATA 



CCTGC GTCT C 
CTTAGGTTGG 



CCGCTATGAC 
TCTGCGGCGA 



OCTCCTTAAG 
GTCCTGCTTG 



AGGTCTAATG 
GGTGGCTGTG 



TOAGAGACCA 
TTC 



GGGGCAGTCA 



CATGGOGGTT 
CGGGGTCTCA 



TACATGCATA 

ATGTACAGCA 

CTTATGC 

ACAGCTGGGG 
ACGCCTGTGG 



CGTCTG GCAG 
CCTTGTCGCC 



TTGGAAATTA 
TACAGTAGTA 



AGTOCTGGGA 
TGAGGATCTG 



TCGGTCAGOG 
T 

GTTATAGGGC 
GCTGTOGTTA 



TTTACAAACG 
AGTATTCCTC 



CACCGCTCCC 



^Repeated 2 times 



Sequence analysis of the OC* stromal cell - cloned DNA 
sequences revealed, in addition to the novel sequences, a 45 
number of previously-described genes. The known genes 
identified (including type 5 acid phosphatase, gelatinase B, 
cysutin C (13 clones), Alu repeat sequences (11 clones), 
creamine kinase (6 clones) and others) are summarized in 
Table II. In situ hybridization (described below) directly y> 
demonstrated that gelatinase B mRNAis expressed in multi- 
nucleated osteoclasts and not in stromal cells. Although 
gelatinase B is a well-characterized protease, its expression 
at high levels in osteoclasts has not been previously 
described. The expression in osteoclasts of cystatin C, a 55 
cysteine protease inhibitor, is also unexpected. This finding 
has not yet been confirmed by in situ hybridization. Taken 
together, these results demonstrate that most of these iden- 
tified genes are osteoclast-expresscd, thereby confirming the 
effectiveness of the differential screening strategy for iden- go 
lifying DNA encoding osteoclast-spedfic or -related gene 
products. Therefore, novel genes identified by this method 
nave a high probability of being OC-spedfic or related. 

In addition, a minority of the genes identified by this 
screen are probably not expressed by OCs (Table D). For 65 
example, type III collagen (6 clones), collagen type I (1 
clone), dermatansul/aie (1 clone), and type VI collagen (1 



clone) are more likely to originate from the stromal cells or 
from osteoblastic cells which are present in the tumor. These 
cDNA sequences survive the differential screening process 
either because the cells which produce them in the tumor in 
vivo die out during the stromal cell propagation phase, or 
because they stop producing their product in vitro. These 
clones do not constitute more than 5-10% of the all 
sequences selected by differential hybridization. 

TABLE H 

SEQUENCE ANALYSIS OF CLONES ENCODING KNOWN 
SEQUENCES FROM AN OSTEOCLASTOMA cDNA 

LIBRARY 



Closes with Sequence Homology 


25 total 


to CoUageoase Type IV 




Clones with Sequence Homology to 


14 total 


Type 5 Tartrate Peri it writ Acid Phosphatase 




Gooes with Sequence Homology to 


13 total 


Cystatin C: 




Clones with Sequence Homology to 


11 total 


Alu-repcal Sfqnmrri 




Gooes with Sequence Homology to 


6 total 


Qcatnine Kinase 




Closes with ScqiTrnry Homology to 


6 total 



5,552,281 
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TABLE II -continued 



SEQUENCE ANALYSIS OF CLONES ENCODING KNOWN 
SEQUENCES FROM AN OSTEOCLASTOMA cDNA 

LIBRARY 



Type m Collagen 

Cones with Sequence Homology to 
MHC Class 1 7 Invariant Chain 
Clones with Sequence Homology to 
MHC diss 0 ? Chun 

One or Two Qcnc<i) with Sctnimrr Homology to Each 

of the Following: 

ctf collagen type I 

y interferon tod sable protein 

ostcoponon 

Human choncVoUiaMfntL'rtflninlfgie 
a globin 

(J ghiraudaie/tphiugolipid acti valor 
Human CAPL proton (Ca binding) 
Human EST 01024 
Type VI coHogeo 
Humsn EST 00553 



5 toul 
3 toul 
10 total 



10 



15 



20 
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UTP digoxygenin labelled cRNA probes. 




TABLE ID 


In Sits HYBRIDIZATION USING PROBES 
DERIVED FROM NOVEL SEQUENCES 




Reactivity with: 


Clone 


Osteoclast! Stromal Cells 


4B 

283* 

37B 

S6B 

S7B 

sac 

98B 

U8B» 

140B* 

198B* 

212B* 

Getatrrase B* 


+ + 
+ + 

+ + 

+ + 

+ 

+ 

+ 

+ - 



35 



40 



Example 5 — In situ Hybridiation of OC-Ex pressed 

Genes 

In situ hybridization was performed using probes derived 25 
from novel cloned sequences in order to determine whether 
the novel putative OC-specific or -related genes are differ- 
entially expressed in osteoclasts (and not expressed in the 
stromal cells) of human giant cell tumors. Initially, in situ . 
hybridization was performed using antisense (positive) and 30 
sense (negative control) cRNA probes against human type 
IV collagcnase/gelatinase B labelled with 35 S-UTP. 

A thin section of human giant cell tumor reacted with the 
antisense probe resulted in intense labelling of all OCs, as 
indicated by the deposition of silver grains over these cells, 
but failed to label the stromal cell elements. In contrast, only 
minima] background labelling was observed with the sense 
(negative control) probe. This result confirmed that gclati- 
nase B is expressed in human OCs. 

In situ hybridization was then carried out using cRNA 
probes derived from 11/32 novel genes, labelled with 
digoxigenin UTP according to known methods. 

The results of this analysts are summarized in Table III. 
Clones 25 B, 118B, MOB, 198B, and 212B all gave positive 45 
reactions with OCs in frozen sections of a giant cell tumor, 
as did the positive control gelatinase B. These novel clones 
therefore are expressed in OCs and fulfill all criteria for 
OC-relatcdness. 198B is repeated three times, indicating 
relatively high expression. Clones 4B, 37B, 88C and 98B 
produced positive reactions with the tumor tissue; however 
the signal was not well-localized to OCs. These clones are 
therefore not likely to be useful and are eliminated from 
further consideration. Clones 86B and 87B failed to give a 
positive reaction with any cell type, possibly indicating very 55 
low level expression. This group of clones could still be 
useful but may be difficult to study further. The results of this 
analysis show that 5/1 1 novel genes are expressed in OCs, 
indicating that -50% of novel sequences likely to be OC- 
relaled. go 

To generate probes for the in situ hybridizations, cDNA 
derived from novel cloned osteoclast-specific or -related 
cDNA was subcloned into a BlueScript D SK(-) vector. The 
orientation of cloned inserts was determined by restriction 
analysis of subclones. The T7 and T3 promoters in the 65 
BlueScripUI vector was used to generate w S-labeIled ( M S- 
UTP 850 Ci/mmol, Amersham, Arlington Heights, 111.), or 



50 



•OC-cx pressed, as fntEcatrd by reactivity with antisense probe and lack of 
reactivity with sense probe on OCs only. 

In situ hybridization was carried out on 7 micron cryostat 
sections of a human osteoclastoma as described previously 
(Chang. L.-C. et al. Cancer Res. 49:6700 (1989)). Briefly, 
tissue was fixed in 4% paraformaldehyde and embedded in 
OCT (Miles Inc., Kankakee, m.). The sections were reby- 
d rated, postfixed in 4% paraformaldehyde, washed, and 
pretreated with lOmMDTT, 10 mM iodoacctamide, 10 mM 
N-ethylraaleimide and 0.1 triethanoIaminc-HCL. Prehybrid- 
ization was done with 50% deiohized formamide, 10 mM 
Tris-HCl, pH 7.0, lx Dcnhardt's, 500 mg/ml tRNA, 80 
mg/ml salmon sperm DNA, 0.3M NaCl, mM EDTA, and 
100 mM DTT at 45° C f or 2 hours. Fresh hybridization 
solution containing 10% dextran sulfate and 1.5 ng/ml 
"S-labelled or digoxygenin labelled RNA probe was 
applied after beat denaturation. Sections were coversh'pped 
and then incubated in a moistened chamber at 45°-5Q° C. 
overnight Hybridized sections were washed four times with 
50% formamide, 2x SSC, containing 10 mM DTT and 0.5% 
Triton X-100 at 45° C Sections were treated with RNase A 
and RNase Tl to digest single-stranded RNA, washed four 
times in 2x SSO10 mM DTT. 

In order to delect 33 S-labelling by autoradiography, slides 
were dehydrated, dried, and coated with Kodak NTB-2 
emulsion. Hie duplicate slides were split, and each set was 
placed in a black box with desiccani, sealed, and incubated 
a 4° C for 2 days. The slides were developed (4 minutes) 
and fixed (5 minutes) using Kodak developer D19 and 
Kodak fixer. Hematoxylin and eosin were used as counter- 
stains. 

In order to detect digoxygenin-labelled probes, a Nucleic 
Acid Detection Kit (Bocrirmger-Maruiheim, Cat #1175041) 
was used. Slides were washed in Buffer 1 consisting of 100 
mM Tris/150 mM NaCl, pH7.5, for 1 minute. 100 ul Buffer 
2 was added (made by adding 2 mg/ml blocking reagent as 
provided by the manufacturer) in Buffer 1 to each slide. The 
slides were placed on a shaker and gently swirled at 20° C 

Antibody solutions were diluted 1:100 with Buffer 2 (as 
provided by the manufacturer). 100 ul of diluted antibody 
solution was applied to the slides and the slides were then 
incubated in a chamber for 1 hour at room temperature. The 
slides were monitored to avoid drying. After incubation with 
antibody solution, slides were washed in Buffer 1 for 10 
minutes, then washed in Buffer 3 containing 2 mM levami- 
sole for 2 minutes. 

After washing, 100 ul color solution was added to the 
slides. Color solution consisted of nitroblue/tetrazolium salt 
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(NBT) (1:225 dilution) 4.5 pi, 5-bronK>^-cblon)-3-indolyl 
phosphate (1:285 dilution) 3 J jol, levaimsole 0.2 mg in 
Buffer 3 (as provided by the manufacturer) in a total volume 
of 1 ml. Color solution was prepared immediately before 
use. 

After adding the color solution, the slides were placed in 
a dark, humidified chamber at 20° C for 2-5 hours and 
monitored for color development. The color reaction was 
stopped by rinsing slides in TE Buffer. 

The slides were stained for 60 seconds in 0.25% methyl 
green, washed with tap water, then mounted with water-, 
based Pennount (Fisher). 

Example 6 — Imrriimohistochernistry 

is 

Irnrnunohistochemical staining was performed on frozen 
and paraffin embedded tissues as well as on cytospin prepa- 
rations (see Table IV). The following antibodies were used: 
polyclonal rabbit anti-human gelatinase antibodies; AbllO 
for gelatinase B; monoclonal mouse anti-human CD68 anti- 
body (clone KP1) (DAKO, Denmark); Mol (anti-CDllb) 
and Mo2 (anti-CD14) derived from ATCC cell lines HB 
CRL 8026 and TIB 228/HB44. The anti-human gelatinase B 
antibody Abl 10 was raised against a synthetic peptide with 
the amino acid sequence EALMYPMYRFTEGPPLHK 25 
(SEQ ID NO: 34), which is specific for human gelatinase B 
(Corcoran, M. L. et al. / Biol Chem. 267:515 (1992)). 

Detection of the irnrnunohistochemical staining was 
achieved by using a goat anti-rabbit glucose oxidase kit 
(Vector Laboratories, Burlingame Calif.) according to the 30 
manufacturer's directions. Briefly, the sections were rehy- 
drated and pretested with cither acetone or 0.1% trypsin. 
Normal goat serum was used to block nonspecific binding. 
Incubation with the primary antibody for 2 hours or over- 
night (Abll0:l/500 dilution) was followed by either a glu- 
cose oxidase labeled secondary anti-rabbit serum, or, in the 
case of the mouse monoclonal antibodies, were reacted with 
purified rabbit anti-mouse Ig before incubation with the 
secondary antibody. 

Paraffin embedded and frozen sections from osteoclasto- 
mas (GCT) were reacted with a rabbit antiserum against 
gelatinase B (antibody 110) (Corcoran, M. L. et al. J. Biol 
chem. 267:515 (1992)), followed by color development with 
glucose oxidase linked reagents. The osteoclasts of a giant 
cell tumor were uniformly strongly positive for gelatinase B, 
whereas the stromal cells were unreacu've. Control sections 
reacted with rabbit pre immune serum were negative. Iden- 
tical findings were obtained for all 8 long bone giant cell 
tumors tested (Table IV). The osteoclasts present in three out 
of four central giant cell granulomas (CCG) of the mandible 50 
were also positive for gelatinase B expression. These neo- 
plasms are similar but not identical to the long bone giant 
cell tumors, apart from their location in the jaws (Shafer, W. 
G. et al.. Textbook of Oral Pathology, W. B. Saunders 
Company, Philadelphia, pp. 144-149 (1983)). In contrast, 
the multinucleated cells from a peripheral giant cell tumor, 
which is a generally non-resorpri ve tumor of oral soft tissue, 
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were un reactive with antibody (Shafer, W. G. et al., Text- 
book of Oral Pathology, W. B. Saunders Company, Phila- 
delphia, pp. 144-149 (1983)). 

Antibody 110 was also utilized to assess the presence of 
gelatinase B in normal bone (n=3) and in Paget' s disease, in 
which there is elevated bone remodeling and increased 
osteoclastic activity. Strong staining for gdaiinase B was 
observed in osteoclasts both in normal bone (mandible of a 
2 year old), and in Paget' s disease. Staining was again absent 
in controls incubated with pnammunc serum. Osteoblasts 
did not stain in any of the tissue sections, indicating that 
gelatinase B expression is limited to osteoclasts in bone. 
Finally, peripheral blood monocytes were also reactive with 
antibody 110 (Table IV). 

TABLE IV 

DISTRIBUTION OF GELATINASE B IN VARIOUS 

TISSUES 



Simple! 



Antibodies tested 
Ab 110 
gelatisase B 



GCT frozen 
(n = 2) 

gum cells 
stromal cells 
GCT paraffin 
(n-6) 

giant cell* 
stromal cells 
central GCG 
(n = 4) 

giant cells 
soomal cells 
peripteral GCT 
fa -4) 

giant cells 
stromal cells 
Pagct's disease 
(o = l) 

osteoclasts 
osteoblasts 
normal bone 
(a = 3) 

osteoctasu 
osteoblasts 
monocytes 
(cytospin) 



+(%) 



+ 
+ 



55 



Distribution of gelatinise B in multinucleated giant cells, osteoclasts, osteo- 
blasts and stromal cells is various tissues. In general, ptraibn embedded 
tissues were used for these experiments; exceptions ore indxaied. 

Equivalents 

Those skilled hi the art will recognize, or be able to 
ascertain using no more than routine experimentation, many 
equivalents lo the specific embodiments described herein. 
Such equivalents are intended to be encompassed by the 
following claims. 



SEQUENCE LISTING 



( I ) GENERAL INFORMATION: 

( M t ) NUMB EE OP SEQUENCES: 3* 
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( 2 ) INFORMATION FOR SEQ CD NOtl: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: !70bs*epUn 
( B )TYPE: axAac K$d 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY, lioof 

( ! 1 ) MOLECUJB TYPE: DNA (paeattc) 

(,|) SEQUENCE DESOUTnON: SEQ ID NO:l: 
GCAAATATCT AAOTTTATTG CTTOOATTTC TACTGACACC TCTTOAATTT ■ CCTGATCTCA 6 0 

AATOTTTCTA G0GTTTT7TT AGTTTGTTTT TATTGAAAAA TTTAATTATT T ATG C T A TAG I 2 0 

GTGATATTCT C T TTGA AT A A ACCTATAATA GAAAATAGCA GCAGACAACA 170 

( 2 ) INFORMATION FOR SEQ ID N02: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 63 bne paij 
< B ) TYPE: csdoc tdi 
( C ) STRANDED NESS: doable 
( 0 )TOF0LOOY:CDCir 

(11) MOLECULE TYPE: DNA Ucaocak) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID K02: 
CTGTCAACCT GCATATCCTA AAA A TQ TC A A AATCCTOCAT CTGGTTAATG TCGGGGTAOG 60 
C G 0 6 3 

( 2 ) INFORMATION FOR SEQ ID N0:3: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 163 bus fun 
( B ) TYPE: nodcic irid 
( C ) STRANDED NESS: doable 
( D ) TOPOLOOY; fi»c 

( i i ) MOLECULE TYPE: DNA (fesomic) 

( t i ) SEQUENCE DESCRIPTION: SEQ ID NO J: 
CTTCCCTCTC TTGCTTCCCT TTCCCAAGCA GAGGTGCTCA CTCCATGGCC ACCGCCACCA 60 
CACGCCCACA GGGAOTACTG CCAGACTACT GCTGATCTTC TCTTAACGCC CAOOOAOTCT 120 
CAACCAGCTG CTGGTGAATG CTGCCTGGCA CGGGACCCCC CCC 163 

( 2 ) INFORMATION FOR SEQ ID NO:* 

( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: ITS buc pain 
( B ) TYPE: sadde tcid ■ 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY Uacxr 

( i i > MOLECULE TYPE: DNA (jencniic) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO*: 
TTTTATTTGT AAATATATGT ATTACATCCC TAGAAAAAGA A7CCCAGGAT TTTCCCTCCT 60 
GTGTCTTTTC OTCTTOCTTC TTCATGGTCC A TG A TGCC AG CTGAGGTTOT C A G T A C A ATG I JO 

AAA CC A A A C T GGCGCGATGG AAGCAOATTA TTCTOCCATT TTTCCAGGTC TTT 173 

( 2 ) INFORMATION FOR SEQ ED NOJ; 

{ i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 197 baxe pitn 
( B )TYPE: Bs±kx «cid 
( C ) STRANDEDNESS: dcable 
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( D ) TOPOLOGY. Bnor 

.< I i ) MOLECULE TYPE: DNA (pnocSe) 

( i 1 ) SEQUENCE DESCRIPTION: SEQ ID NOJ: 
GGCTCOACAT CGGTGCCCTC CACCTCCCTC A>T ATCC C C A G GCACACTCTG G CCTC AGGTT 6 0 

TTGCCCTCGC CATGTCATCT ACCTGGAGTG OGCCCTCCCC TTCTTCAOCC TTGAATCAAA 120 
AOCCACTTTG TTACGCGAGG ATTTCCCAGA CCACTCATCA CAT TA A A A A A TATTTTOAAA ISO 
A C A A A A A A A A A A A A A A A 197 

( 2 ) INFORMATION FOR SEQ ID NOA 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 132 bwc pain 
( B )TYPB oneiric acid 
( C ) STRANDEDNESS: doubt 
( D ) TOPOLOGY: linear 

( i i )M01£OW TYPE: DNA (jcncoic) 

{si) SEQUENCE DESCRIPTION: SEQ ID NO:fi: 
TTCACAAAGC TGTTTATTTC CACCAATAAA TACTATATGO TCATTGGGGT TTCTATTTAT 6 0 

A A G AG T AGTG GCTATTATAT OOOGTATCAT OTTGATGCTC A T A A AT AGTT CATATCTACT 120 
TAATTTGCCTTC >3 2 

< 2 ) INFORMATION FOR SEQ ID NO:7: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 75 buc pain 
( B ) TYPE: mdcb acu! 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: lincxr 

( i i ) MOLECULE TYPE: DNA (gcaoraie) 

( i i ) SEQUENCE DESCRIPTION: SEQ ID NO:7: 
CAAGAGAGTT OTATGTACAA CCCCAACAGG CAAGGCAGCT AAATGCAOAO GGTACAGAGA 60 
GATCCCGAGG GAATT '5 

( 2 ) INFORMATION FOR SEQ CD NO:8: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 1S1 buc pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS; double 
( D ) TOPOLOGY: Kccar 



( i i ) MOLECULE TYPE: DNA (gauaak) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO* 
GGATGGAAAC ATOTAOAAGT CCAGAGAAAA ACAATTTTAA AAAAAOGTGG AAAAGTTACG 60 
GCAAACCTGA GATTTCAGCA TAAAATCTTT AGTTAGAAGT GAGAGAAAGA AGAGGGAGGC 120 
TGOTTOC T OT TGCACGTATC A AT AGGTT A T C 151 

( 2 ) INFORMATION FOR SEQ ID NO* 

( i ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 141 baae pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: fiscar 



( i i ) MOLECULE TYPE: DNA (reaonic) 
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( i | ) SEQUENCE DESCRIPTION: SEQ ID NO^: 

TTCTTGATCT TTACAACACT ATGAATAGOG AAAAAAGAAA AAACTGTTCA AAATAAAATC . 60 

TAGGAGCCGT GCTTTTGGAA TGCTTGACTO AGGAGCTCAA CAAGTCCTCT CCCAAGAAAG 120 

CAATGATAAA ACTTGACAAA A 14 1 

( 2 ) INFORMATION FOR SEQ ID NO: 10: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 162 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: focv 

( i I ) MOLECULE TYPE: DNA feeoeaue) 

(si) SEQUENCE DESCRIPTION: SEQ ID NO: 10. 
ACCCATTTCT AACAATTTTT ACTGTAAAAT TTTTGOTCAA AGTTCTAAGC TTAATCACAT 6 0 

CTCAAAGAAT AGAGGCAATA TATAOCCCAT CTTACTAOAC ATACAOTATT AAACTGGACT 120 
GAATATGAGG ACAAOCTCTA GTGGTCATTA AACCCCTCAG AA * 162 

( 2 ) INFORMATION FOR SEQ ID Nfcll: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 137 base pan 
( B ) TYPE: Buddc icid 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY: linear . 

( i i ) MOLECULE TYPE; DNAacoomk:) 

( a i ) SEQUENCE DESCRIPTION: SEQ ID NO: 11: 
ACATATATTA ACAGCATTCA TTTGGCCAAA ATCTACACGT TTGTAG AATC CTACTGTATA 60 
TAAAGTGGGA ATGTATCAAG TATAGACTAT GAAAGTGCAA ATAACAAOTC AAOOTTAOAT 120 

TAACTTTTTT TTTTTACATT ATAAAATTAA CTTGTTT 157 

3 

( 2 ) INFORMATION FOR SEQ ID NO. 1 2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 75 base pain 
< B ) TYPE: nodcic arid 
( C ) STRANDEDNESS: double 
( 0 ) TOPOLOGY: linear 

{ i i ) MOLECULE TYPE: DNA (jcaomte) 

(it.) SEQUENCE DESCRIPTION: SEQ ID NO: 12: 
CCAAATTTCT CTGGAATCCA TCCTCCCTCC CATCACCATA GCCTCGAGAC GTCATTTCTG 60 
TTTCACTACT CCAGC 75 

( 2 ) INFORMATION FOR SEQ ID NO: 11 

( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 124 baic pairs 
( B ) TYPE: atactic and 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (jesoaic) 

( a i ) SEQUENCE DESCRIPTION: SEQ CD NO: 11 
AACTAACCTC CTCOGACCCC TGCCTCACTC ATTTACACCA ACCACCCAAC TATCTATAAA 6 0 

CCTGAGCCAT GGCCATCCCT TATGAGCGGC GCAGTGATTA TAOGCTTTCG CTCTAAOATA 120 





21 
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i a 4 



< 2 ) INFORMATION FOR SEQ ID KOil*: 

< t ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 131 lac p3n 
( B ) TYPE: ctckk add 
( C ) STRAWCEDNESS: double 
( D ) TOPOLOGY: liaeir 

( I I ) MOLECULE TYPE: DNA (gesans) 

(it) SEQUENCE DESCRIPTION: SEQ ED NO:M: 
ATTATTATTC TTTTTTTATC TTACCTT AG C CATCCAAAAT TTACTOCTCA AOCAGTTAAT 
A A A A C A C A C A TCCCATTGAA GGGTTTTGTA CAT TTCACTC CTTACAAATA AC A AAGCAAT 
CATAAACCCG GCACGTCCTG ATAGGAAATT C 



6 0 
1 2 0 
1 5 1 



< 2 ) INFORMATION FOR SEQ ID NO: 1 5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 105 hue putt 
( B ) TYPE: Pirktc add 
( C ) STRANDED NESS: douUc 
( D ) TOPOLOGY: finrir 

( i i t MOLECULE TYPE: DNA (jeoaaic) 

( i i > SEQUENCE DESCRIPTION: SEQ ID NO:15: 
CGTGACACAA ACATGCATTC OTTTTATTCA TAAAACAGCC TGGTTTCCTA A A A C A AT AC A 
AACAGCATGT TCATCAOCAO OAAGCTGGCC GTGGGCAGGG OOOCC 



6 0 
I 0 s 



( 2 ) INFORMATION FOR SEQ ID NO-.16; 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 2«6 base pti» 
( B > TYPE: nsddc tod 
( C ) STRAND EDNE5S: double 
( D ) TOPOLOGY: hrxtr 

( I I ) MOLECULE TYPE: DNA (jenacx) 

( s i ) SEQUENCE DESCRIPTION: SEQ ID NO-.16: 
ATAOGTTAOA TTCTCATTCA COOGACTAGT TACCTTTAAG CACCCTACAG GACTAGGGTA 
ATCTGACTTC TCACTTCCTA AGTTCCCTCT TATATCCTCA AGCT AC A A AT GTCTATOTTT 
TCTACTCCAA TTCATAAATC TATTCATAAG TCTTTGGTAC AAGTTACATG ATAAAAAGAA 
ATGTGATTTG TCTTCCCTTC TTTGCACTTT TGAAATAAAG TATTTA.TCTC CTOTCTACAC 
TTTAAT 



6 0 
1 2 0 

1 8 0 

2 4 0 
2 4 6 



( 2 ) INFORMATION FOR SEQ ED NO: 17: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 118 buc pain 
( B ) TYPE: Budric iekj 
( C ) STRAND EDNESS : doable 
( D ) TOPOLOGY, fcnor 

( i i ) MOLECULE TYPE: DNA (genomic) 

( l I ) SEQUENCE DESCRIPTION: SEQ ID NO: 17; 
GTCCAGTATA A A G G A A A G CG TTAAGTCOGT A AGCT AG A GG ATTCTAAATA TCTTTTATGT 
CCTCTAGATA A A A C ACCCG A TT A AC A G A TG TTAACCTTTT ATGTTTTGAT TTGCTTTAAA 
AATGGCCTTC TACACATTAG CTCCAGCTAA AAAGACACAT TGAGAGCTTA GAGOATAGTC 



6 0 
1 2 0 
1 I 0 



i 
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( 2 ) INFORMATION FOR SEQ ID NOJt: 

( I ) SEQUENCE OUlACIBUnTCS: 
( A ) LENGTH: 112 bue p*in 
( B ) TYPE: mricie nt 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: 



( I I ) MOLECULE TYPE: DNA (jcnmsic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 

CCACTTCGAA OCCAGTTOOT CTCCTATTTT TOAAOCACAT GTGOTOATAC TGAGATTGTC 60 

TCTTCAOTTT CCCCATTTGT TTCTGCTTCA AATCATCCTT CCTACTTTGC TTCTCTCCAC 120 

CCATOACCTT TTTCACTGTG OCCATCAAGG AC7TTCCTGA C AGCTTGTGT ACTCTTAGGC 180 

TAAGAGATOT GACTACACCC TGCCCCTGAC TO 212 

( 2 ) INFORMATION FOR SEQ ID KOtlft 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 203 base psxs 
-(B) TYPE: todoc arid 
( C ) STRANDEDNESS: doab!c 
( D ) TOPOLOGY: fectr 

( i i ) MOLECULE TYPE- DNA (pmcmic) 

(si) SEQUENCE DESCRIPTION: SEQ ID NO: 19: 

TGTT AGTTTT T A GO A AGCCC TOTCTTCTGG GAGTGAGCTT TATTAGTCCA CTTCTTOGAG 60 

CTAGACGTCC T AT AGTTAGT CACTGGGGAT GGTG A A A G AG GGAGAAGAGG AAGGGCGAAG 120 

GG A AGOG CTC T T TGCTAGT A TCTCCATTTC TAGAAGATGG TTTAGATGAT AACCACAGGT ISO 
CTATATGAGC ATAOTAAGGC TGT 203 

( 2 ) INFORMATION FOR SEQ ID NOdOt 

( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 177 buc pus 
( B ) TYPE: DDdtic Ktd 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY: fee* 

( i i ) MOLECULE TYPE: DNA (gnomic) 

( l i ) SEQUENCE DESCRIPTION: SEQ ID NO20: 
CCTATTTCTG ATCCTOACTT TGGACAAGGC CCTTCAGCCA CAAGACTGAC AAAGTCATCC 60 
TCCOTCTACC AO AGCGTGC A CTTGTGATCC TAAAATAAGC TTCATCTCCC GCTGTGCCTT 120 
G GG TGG A AGO GGCAGGATTC TGCAGCTCCT TTTOCATTTC TCTTCCTAAA TTTCATT 177 

( 2 ) INFORMATION FOR SEQ ID NO-Jfc 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: IOC buc pairs 
( B ) TYPE: eadeic add 
( C ) STRANDEDNESS: doub le 
( D )TOPOLOGtt! 



( t I ) MOLECULE TYPE: DNA (cesomic) 

( x i ) SEQUENCE D ESC R IP T IO N: SEQ P N031: 
CGGAGCGTAG GTGTGTTT AT TCCTGTACAA ATCATTACAA AACCAAGTCT O OG G C AG T C A 6 0 

CCGCCCCCAC CCATCACCCC AGTGCAATGG CTAGCTGCTG GCCTTT 106 
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( 2 ) INFORMATION FOR SEQ ID K02i 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 139 twe piin 
( B ) TYPE: Bi i rV'r add 
( C ) STRANDEDNESS: docbfc 
( D ) TOPOLOGY: 



( i i ) MOLECULE TYPE: DNA Ctcsooic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N022: 
TTAOTTCAOT CAAAOCAOOC AACCCCCTTT OOCACTOCTC CCACTGOOCT CATOOCGOTT 60 
GTGCCAGCTG GGGA GGTTTC CCCAACACCC TCCTCTGCTT CCCTGTCTGT CGOGGTCTCA 120 
GGAOCTOACC CAGAGTGOA 139 

( 2 ) INFORMATION FOR SEQ ID NOJ3: 

( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 177 buc ptift 
{ B ) TYPE: mrVr acid 
( C ) 5TRANDED.VESS: double 
( D ) TOPOLOGY: Ssexr 

( i i ) MOLECULE TYPE; DNA (ffaaait) 

( x i ] SEQUENCE DESCRIPTION: SEQ ID NCfcU: 
GCTGAATOTT T A AG AG AG A T T T T G G T CTT A AAGGCTTCAT CATOAAAGTG T A C A TGC A T A 60 
TGCAAOTGTG A A T T AC GTGG TATGGATGGT TGCTTOTTTA TTAACTAAAC ATGTACAGCA J 20 

AACTGCCCGT TTAOAGTCCT CTTAATATTG ATGTCCTAAC ACTGGGTCTG CTTATGC 177 

( 2 ) INFORMATION FOR SEQ ID N02<: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 167 buc pain 
{ B ) TYPE: Botkx acid 
( C ) STRANDEONESS: doable 
( D ) TOPOLOGY: Enctr 

( i I ) MOLECULE TYPE: DNA (genomic) 

{ i i ) SEQUENCE DESCRIPTION: SEQ ID N024: 
COCA GTGGGA TATOOAATCC AGAAGGGAAA CAAGCACTGG ATAATTAAAA AC AOCTGGGG 6 0 

A G A A A A C TGG GGAAACAAAG G ATAT ATCCT CATGGCTCGA A AT A A. G A A C A ACGCCTGTGG 120 
CATTGCCAAC CTOOCCAGCT TCCCCAAGAT GTGACTCCAG CCAGAAA 167 

( 2 ) INFORMATION FOR SEQ ID NO-.25: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 131 buc jain 
( B ) TYPE: suefce ceil 
( C ) STRANDED VE5S: double 
( D ) TOPOLOGY: liflcir 

( i i ) MOLECULE TYPE: DNA (jenomie) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOJ5 
GCCAGGGCGG ACCGTCTTTA TTCCTCTCCT GCCTCAGAGG TCAGGAAGGA GGTCTGGCAG 60 
GACCTOCAGT GGGCCCTACT CATCTGTGGC AGCGAAGGTO AAOGGACTCA CCTTGTCGCC 120 
COTGCCTOAO TAOAACTTOT TCTGGAATTC C 151 



( 2 ) INFORMATION FOR SEQ ID N026: 

( i ) SEQUENCE CHARACTERISTICS: 

j 
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( A ) LENGTH; 156 base pus 
( B >TTfPE;cpdcic »d 
( C ) STRANDEDNESS: docile 
(D)TOPOLOGYfcr*r 

( i i ) MOLECULE TYPE: DNA (fenamic) 

( x i ) SEQUENCE DESCRIPTOR SEQ ID NOdfi: 
AACTCTTTCA CACTCTOGTA TTTTTAOTTT AACAATATAT GTGTTCTGTC TTGCAAATTA 60 
OTTCATATCA ATT C AT ATTG AGCTGTCTCA TTCTTTTTTT AATGGTCATA TACAGTAOTA 120 
TTCAATTATA AGAATATATC CTAATACTTT TTAAAA 156 

( 2 ) INFORMATION FOR SEQ ID N027: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: L50 hue ptin 
( B ) TYPE: bscIdc arid 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE DNA(rcoonric) 

(mi) SEQUENCE DESCRIPTION: SEQ ZD N027: 
GGATAAGAAA GAAGCCCTGA OGGCTAGCGG CCGGGGCTGG CCTGCGTCTC AGTCCTGGGA 60 
CCCACCACCC CGCACAGOTT G A G AGGGGC A CTTCCTCTTG CTTAGGTTGG TGAGGATCTG 120 
CTCCTGGTTG GCCGGTGGAG AOCCACAAAA 150 

( 2 ) INFORMATION FOR SEQ ID 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 212 base pain 
( B } TYPE: sudetc Bad 
( C ) STRANDED NESS: double 
( D J TOPOLOGY: Encar 

( i i ) MOLECULE TYPE: DNA (genomic) 

(si) SEQUENCE DESCRIPTION: SEQ ID 

GCACTTGGAA GGGAGTTGGT OTG CTATTTT TGAAGCAGAT CTGGTGATAC TGAGATTCTC 60 

TGTTCAGTTT CCCCATTTGT TTGTGCTTCA AATOATCCTT . CCTACTTTGC TTCTCTCCAC 120 

CCATGACCTT TTTCACTGTC GCCATCAAGG A C TTTCC TG A CAGCTTGTGT ACTCTTAGGC ISO 

TAACAGATCT CACTACAGCC TOCCCCTOAC TC 2 12 

( 2 ) INFORMATION FOR SEQ S> N029: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH 157 bate ptin 
( B ) TYPE: nucleic tcid 
( C ) STRAND ED NESS: doable 
( D ) TOPOLOGY: Vaea 

( I i ) MOLECULE TYPE: DNA (feoomic) 

(si) SEQUENCE DESCRIPTION; SEQ ID N029: 
ATCCCTGGCT GTGGATAGTG CTTTTGTGTA GCAAATGCTC CCTCCTTAAG GTT ATAGGGC 60 
TCCCTGAGTT TGGGAGTGTG GAAGTACTAC TTAACTGTCT GTCCTGCTTG GCTGTCGTTA 120 
T CGTTTT C TO CTOATCTTG T OCTAACAATA AG A AT A C 157 

( 2 ) INFORMATION FOR SEQ ID NO-JO: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: LS2 base pan 
( B ) TYPE: oudcie acid 
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( C ) STRANDEDNESS: douUe 
( D ) TOPOLOOY: Erc*r 

( i i ) MOLECULE TYPE: DNA (genetic) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO-JO. 
GGCTGGOCAT CCCTCTCCTC CTCCATCCCC ATACATCACC AGGTCTAATG T T T A C A A A C G 6 0 

GTGCCAGCCC GGCTCTGAAG CCAAGGOCCG TCCOTGCCAC OOTOOCTGTO AGTATTCCTC 120 
COTTAOCTTT CCCATAAGGT TGGAGTATCT CC 152 

( 2 ) INFORMATION FOR SEQ ID NOdl: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 90 base pttn 
( B ) TYPE: rack* icid 
( C ) STRANDED NESS: double 
( D ) TOPOLOGY: 

( i i ) MOLECULE TYPE: DNA (feaondc) 

(si) SEQUENCE DESCRIPTION: SEQ O N031: 
CCAACTCCTA CC G C G AT AC A GACCCACAGA GTGCCATCCC TG AG AG AC C A GACCGCTCCC 60 
CAATACTCTCCTAAAATAAACATOAAOCAC 9 0 

( 2 ) INFORMATION FOR 5EQ ZD NOJ2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH 43 hue pain 
( B ) TYPE: ouckic add 
( C ) STRANDED NESS: double 
( D ) TOPOLOGY: Esctr 

(i i ) MOLECULE TYPE: DNA (gcacmuc) 

(ti) SEQUENCE DESCRIPTION: SEQ ID N032: 

CATGGATGAA TCTCTCATGG TCGGAAGGAA CATOOTACAT TTC 4 3 

( 2 ) INFORMATION FOR SEQ ID N033: 

{ i ) SEQUENCE CHARACTERISTICS: 

( A ) LENGTH: 2333 buc ptin 
( B ) TYPE: mskic acid 
( C ) 5TRANDEDNESS: docNe 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (rtoomic) 

( z i ) SEQUENCE DESCRIPTION: SEQ ID NOJ3: 

AGACACCTCT CCCCTCACCA TGAGCCTCTG GCAGCCCCTC GTCCTOGTOC TCCTOGTOCT 60 

OGGCTGCTCC TTTGCTGCCC CCACACAOCG CCAGTCCACC CTTGTGCTCT TCCCTCGACA 120 

CCTGAGAACC AATCTCACCG ACAGGCAGCT GGCAOAOOAA TACCTOTACC OCTATOOTTA 110 

CACTCCOCTC GCAGAGATCC GTOOAOAOTC GAAATCTCTG GGGCCTGCGC TGCTCCTTCT 240 

CCAGAAGCAA CTOTCCCTGC CCOAGACCGG TOAOCTOOAT AOCGCCACGC TGAAGOCCAT 3 00 

GCGAACCCCA CGGTGCGGGG TCCCAGACCT GGGCAGATTC CAAACCTTTG AGGGCGACCT 340 

CAAOTGOCAC CACCACAACA TCACCTATTO OATCCAAAAC TACTCGGAAG ACTTGCCGCO 420 

GOCGGTGATT OACGACGCCT TTGCCCGCGC CTTCCCACTG TGGAGCGCCG TGACCCCGCT 410 

CACCTTCACT COCGTGT A C A GCCGGOACOC AGACATCGTC AT CCA CTTTG GTOTCGCGGA 340 

GCACGGAGAC GGGTATCCCT TCGACGGGAA GGACGGGCTC CTGGCACACG CCTTTCCTCC 600 

TGGCCCCGGC ATTCAGGGAG ACGCCCATTT CGACGATGAC GAGTTGTCGT CCCTGGGCAA 660 
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GGGCOTCGTG CTTCCAACTC OGTTTCCAAA CGCAGATGGC GCGGCCTGCC ACTTCCCCTT 720 

CATCTTCCAG GGCCGCTCCT ACTCTCCCTG CACCACCGAC GGTCGCTCCG ACGGGTTGCC .780 

CTGGTGCAOT ACCACGGCCA AC T ACG A C AC CGACGACCGO TTTCGCTTCT CCCCCAOCOA 140 

CACACTCTAC ACCCOOOACG GCAATOCTGA TOOOAAACCC TGCCAGTTTC CATTCATCTT 900 

CCAAGGCCAA TCCTACTCCG CCTGCACCAC G G A CGGTCGC TCCGACGGCT ACCOCTGGTG 9 6 0 

CGCCACCACC GCCAACTACG ACCGGOACAA GCTCTTCGGC TTCTGCCCCA CCCG AG C TG A I 020 

CTCCACOOTG ATGGOOOOCA ACTCGGCGGG GGaGCTGTGC GTCTTCCCCT TCACTTTCCT 1010 

GGGTAAGGAG TACTCGACCT GTACCAGCGA GGGCCGCGGA GATGGGCGCC TCTGOTGCGC 1140 

TACCACCTCG AACTTTGACA CCOACAAGAA OTGCGOCTTC TGCCCGGACC A AGO A T A C AG 1200 

TTTGTTCCTC GTGGC GGCGC ATG AGTTCGG CCACGCGCTG GCCTTAGATC ATTCC TC AGT 1260 

OCCGGAGGCG CTCATGTACC CTATGTACCG CTTCACTGAG GGCCCCCCCT TGCATAAGGA 1320 

CGACGTGAAT OGCATCCGGC ACCTCTATGG TCCTCGCCCT G A ACCTGAG C CACGGCCTCC 1380 

AACCACCACC ACACCGCAGC CCACGGCTCC CCCGACGGTC TGCCCCACCG GACCCCCCAC 1440 

TGTCCACCCC TCAGAGCCCC CCACAOCTCC CCCCACAGGT CCCCCCTCAG CTOOCCCCAC 15 00 

AGGTCCCCCC ACTGCTGGCC CTTCTACGGC CACTACTGTG CCTTTGAGTC CGGTGGACGA 15 60 

T GCCTCC A AC GTG A AC AT CT TCGACGCCAT - CGCGGAGATT GGGAACCAGC TGTATTTCTT 16 20 

CAAGGATGGG AAGTACTGGC G A T TC TC TO A GO C CAGGGGG AOCCGGCCGC AGGGCCCCTT 1680 

CCTTATCGCC G AC A AG TGGC CCGCGCTGCC CCGCAAGCTG CACTCCGTCT TTGAGGAGCC 1740 

CCTCTCCAAG AAGCTTTTCT TCTTCTCTGG GCGCCAGGTG TGGGTGTA C A CAGGCGCGTC I 800 

GGTGCTGOGC CCOAOGCGTC TGGACAAGCT G GGCCTGGG A GCCGACGTGG CC C AGGTG A C 1860 

CGGGGCCCTC CGGAGTGGCA GGGGGAAGAT GCTGCTGTTC AGCGGGCGGC GCCTCTGGAG 19 20 

GTTCOACCTG A A GGCGC AG A TGGTGGATCC CCGGAGCGCC AOCOAGGTGG ACCGGATGTT 19 80 

C CCCG G G G TC CCTTTCOACA CGCACGACGT CTTCCAGTAC CG AG AG A A A G CCTATTTCTG 2 0 4 0 

CCAGGACCGC TTCTACTGCC GCGTOAGTTC CCGGAGTGAG TTGAACCACG TOOACCAAOT '2100 

GGGCTACGTG ACCTATOACA TCCTGCAGTG CCCTGAGGAC TAGGGCTCCC GTCCTGCTTT 2160 

GCAGTGCCAT GTAAATCCCC ACTGGGACCA A C CCTGGG 0 A AOG AG CC AGT TTGCCGGATA 2 2 2 0 

CAAACTGGTA TTCTGTTCTG GACGAAAGGG AGGAGTOGAG GTGGGCTGGG CCCTCTCTTC 2280 

TCACCTTTOT TTTTTGTTGG AGTCTTTCTA A T A A A CTTG G ATTCTCTAAC CTTT 2 3 3 4 

( 2 ) INFORMATION FOR SEQ TD NO*J4: 

( i ) SEQUENCE CHARACTHUma: 
( A ) LENGTH: 18 tmmo tadj 
( B ) TYPE; xnino acid 
( C > STRANDEDNES5: lb| k 
< D ) TOPOLOGY: ukaown 

( i i ) MOLECULE TYPE; peptide 

(I i ) SEQUENCE DESCRIPTION: SEQ ID N034: 

* 

Civ Alt Leo Mel Tyr Pro Mel Tyt Art P b c Tbt GlmGly Pro Pro Lee 
1 S 10 15 

H I i Ly $ 

We claim: a) DNA sequences set forth in ihe group consisting of 

1. An isolated osteoclasi-specific or -related DNA with woe i-> i* i*«,~n-7 „,,u • i 

sequence, or its complcn^taiy sequence, the DNA 65 SEQ ID NOS. 12, 14, 16 and 17, or their complemen- 

sequencc comprising a nucleic acid sequence selected from lar y strands; and 

the group consisting of: 



5,552,281 



33 



b) DNA sequences which hybridize under standard con- 
ditions to the DNA sequences defined in a). 

2. A DNA construct capable of replicating, in a host cell, 
osteodast-specific or -related DNA, said construct compris- 
ing: 

a) a DNA sequence of claim 1; and 

b) sequences, in addition to said DNA sequence, neces- 
sary for transforming or transfecting a host cell, and for 
replicating, in a host cell, said DNA sequence. 

3. A DNA construct capable or replicating and expressing, 
in a host cell, osteoclast-specific or -related DNA, said 
construct comprising: 



34 



10 



a) a DNA sequence of claim 2; and 

b) sequences, in addition to said DNA sequence, neces- 
• sary for transforming or transfecting fl host cell, arid for 
replicating and expressing, in a host cell, said DNA 

. sequence. 

4. A cell stably transformed or transfected with a DNA 

construct according to claim 3. 

5. A cell stably transformed or transfected with' a DNA 
construct according to claim 4. 
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ABSTRACT 



The present invention provides amino acid sequences of 
peptides that are encoded by genes within the human 
genome, the kinase peptides of the present invention. The 
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and paralogs of 'the kinase peptides, and methods of iden- 
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CCCAGGGCGC 
TCCCGCGCCT 
TGAGGGGAGC 

CGGGACCATG 
GTGGGGACCA 
ACCTGGCACG 

CCCAATGTGC 

CCTGCTGACA 

GTATGGATCC 

GCCTCCGSAA 

GCTCATAGTG 

AGAAACGCAC 

GGAAACCCCT 

TGAGACGGTG 

GGCAGGTGTA 

CTCAACGTGA 

GGCCTTCTTC 

GACCAGCATT 

CTGGGGGAGC 

CACTGTGAGC 

GGCCCAQCCC 

GCCCCATTCC 

GAATGTTTAG 

GTGGGCGCAG 

ACTGTCTGTA 

CCCTGGCCTT 

TCCCTGGCAG 

CTGGACCTGC 

GTCACTAGTC 

AAGACTGATG 

TACTCCAGAT 

AGAGTCCCTT 

GCTTCTGTTT 

TGTGAGAGGA 

GCTGAGAACT 

GCACAGGAAG 

TCAGATCTTC 

GCCTAAAACA 

TTGTCACAGG 

CTTGGTCTTG 

TTAGGCAGCA 

AGATGCTGAG 

CCATGTTTGC 

CACATGTGCA 

AACTCTTCAT 

ACCATATTTA 

AAAAAAAAAA 



CGTAGGCGGT 
GAGGCGGCGG 
TGCTGTGTCC 
TCCGCGCTGG 
CATTGCTCCA 
GCTCTTGCTT 
TCAAGnrrcAT 

GAGTACATTG 

GTTCCCCTGG 

TGGACAAGAC 
GAAGAGAGGA 
CTTGCGCAAG 
ACTGGATGGC 
GATATCTTCT 
TGCAGATCCT 
AGCTTTTCTG 
CCGCTGGCCG 

CTCGAAATTG 
TGGGCATCCC 
ATGCAGTACG 
CCTGCAGGGG 
TGCTGTGAGC 
AAGCAGAACA 
CACCAGGGAA 
AATCCAATAC 
TQGGCCAGGA 
TGGATTGTGG 
TGGGCAGGAT 
CAGCTGGGTG 
GCTCAAAGGG 
CCTGTCCTTC 
MTATGTGGT 
ACCTGCTCAC 
AGCCTCCACC 
TACGGACAAC 
AGGCTGGGGG 
GCTTCTGTTA 
TTTTGCCTM 
CCTAGAGTCT 
GCTTCATGGC 
GCTTGGGCTG 
AGAGATAGCT 
TCTCCCAACT 
GGTACTGGAA 
CACAACTAGA 
ATAAATTGGG 
AAAAAAAAAA 



GCATCCCGTT CGCGCCTGGG 
CGGCAGGAGC TGAGGGGAGT 
CCCGCCTCCT CCTCCCCATT 
CGGGTGAAGA TGTCTGGAGG 
AGCCAGATAT GGTACAGGAC 
CCGGtGAAAG TGATGCGCAG 

TGCTGTGCTG TACAAGGATA 
AGGGGGGCAC ACTGAAGGAC 

CAGCAGAAGG TCAGGTTTGC 

TGTGGTGGTG GCAGACTTTG 
AAAGGGCCCC CATGGAGAAG 
AACGACCGCA AGAAGCGCTA 
CCCTGAGATG CTGAACGGAA 
CCTTTGGGAT CGTTCTCTGT 
GACTGCCTTC CCCGAACACT 
GGAGAAGTTT GTTCCCACAG 
CCATCTGCTG CAGACTGGAG 
GAGGACTCCT TTGAGGCCCT 
GCTGCCTGCA GAGCTGGAGG 
GCCTGACCCG GGACTCACCT 
GGTGTTCTAC AGCCAGCATT 
AGGGCCGTCC GGGCTTCCTG 
AACCATTCCT ATTACCTCCC 
ATGTATCTCC ACAGGTTCTG 
TTGCCTGAAA GCTGTGAAGA 
GGAATCTGTT ACTCGAATCC 
GAGGCTCTTG CTTACACTAA 
CCCAGGGTGA ACCTGCCTGT 
CAGGAGGACT TCAAGTGTGT 
TGTGAAAAAG TCAGTGATGC 
CTGGAGCAAG GTTGAGGGAG 
GGAACAGGCC AGGAGTTAGA 
TGGCTCTAGC CAGCCCAGGG 
TCATGTTTTC AAACTTAATA 
ATCCTTTCTG TCTGAAACAA 
ACTAGAAAGA GGCCCTGCCC 
CTCATACTCG GGTGGGCTCC 
AGCTCGATGG GTTCTGGAGG 
GAGGGAGGGG AGTGGGAGTC 
AACCACTGCT CACCCTTCAA 
GGMGAGGTG GTGGCAGAGT 
CCCTGAGCTG GGCCATCTCA 
CATTAGCTCC TGGGCAGCAT 
AACCTCCATC TTGGCTCCCA 
TTTGCCTCTT CTAAGTGTCT 
MTGGGTTTG GGGTATTAAA 
(SEQ ID N0:1) 



GCTGTGGTCT 
TGTAGGGAAC 
TCCGCGCTCC 

TGTCCAGGCT 
TGTCAACGAA 
CCTGGACCAC 

AGAAGCTGAA 
TTTCTGCGCA 

CAAAGGAATC 

GGCTGTCACG 

GCCACCACCA 

CACGGTGGTG 
AGAGCTATGA 
GAGATCATTG 
GGACTTTGGC 
ATTGTCCCCC 
CCTGAGAGCA 
CTCCCTGTAC 
AG7TGGACCA 
CCCTAGCCCT 
GCCCCTCTGT 
TGGATTSGCG 

CAGGAGGCAA 
GGGCCTAGTT 
AGAAAAAAAC 
ACCCAGGAAC 
TCAGCGTGAC 
GAACTCTGAA 
GGACGAAAGA 
TCCCCCTTTC 
TAGGTTTTGA 
GAAAGGGCTG 
ACCACATCAA 
CTGGAGACTG 
ACAGTCACAA 
TCTAGAAAGC 
TTAGTCAGAT 
ACAGTGTGGC 
TCAGCAATCT 
CATGCCTGGT 
CTCAAAGCTG 

CTTCTACCTC 
CCTCCTGAGC 
GAGCTCTAGG 
ATGAGCTTGC 
AAAAAAAAAA 



FIG.1A 



U.S. Patent Jan. 22, 2002 Sheet 2 of 41 US 6,340,583 Bl 



FEATURES: 

5*UTR: 1-228 

Start Codon: 229 

Stop Codon: 994 

3'UTR: 997 



Homologous proteins: 

Tnp 10 BLAST Hits 



CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 



1000682328847 /altid=gi| 8051618 /def=ref NP_057952.1| LIH d. 



18000005015874 /altid=gi 
88000001156379 /a1tid=gi 
88000001156378 /altid=gi 
18000005154371 /altid=gi 
18000005126937 /altid=gi 
18000005127186 /altid=gi 
18000005127185 /altid=gi 
18000005004416 /altid=gi 
18000005004415 /altid=gi 



5031869 /def=ref 
7434382 /def=pir 
7434381 /def=pir 
7428032 /def=pir 
6754550 /def=ref 
2804562 /def^dbj 
2804553 /def-dbj 
2143830 /def=pir 



NP_005560.1| LIH 
JC5814 LIH motif. 
JC5813 LIH motif. 
JE0240 LIM kinas. 
NP_034848.1| LIH 



BAA24491.1 
BAA24489.1 



(AB00. 
(AB0O. 



1 178847 LIH motif. 



1708825 /def=sp|P53670|LIK2_RAT LI. 



Rl AST dbEST hits: 



9i 

gi 
gi 
gi 
gi 
gi 



• • • 



10950740 /dataset^ibest /taxon=36. . . 
10156485 /datasetr'dbest /taxon=96... 
5421647 /dataset?=dbest /taxon=960f 
10895718 /dataset=dbest /taxon=96. . . 
13043102 /dataset=dbest /taxon=960... 
519615 /dataset?=dbest /taxon=9606 /... 
11002869 /dataset=dbest /taxon=96. . . 



Score 


E 


485 


e-136 


485 


e-136 


469 


e-131 


469 


e-131 


469 


e-131 


469 


e-131 


469 


e-131 


469 


e-131 


468 


e-131 


468 


e-131 


Score 


E 


1049 


0.0 


975 


0.0 


952 


0.0 


757 


0.0 


714 


0.0 


531 


e-149 


511 


e-143 



EXPRESSION INFORMATION FOR MODULATORY USE: 
library source: 

From BLAST dbEST hits: 



gi 

gi 
gi 
gi 
gi 
gi 
gi 



10950740 

10156485 

5421647 

10895718 

13043102 

519615 

11002869 



teratocarcinoma 

ovary 

testis 

nervous_nonnal 
bladder 
infant brain 
thyroid gland 



From tissue screening panels: 

Fetal whole brain 



FIG.1B 
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1 MVQDCQRNLA RLLLPVKVMR SLDHPNVLKF IGVLYKDKKL NLLTEYIEGG 
51 TLKDFLRSMD PFPWQQKVRF AKGIASGMDK TVVVADFGLS RLIVEERKRA 
101 PMEKATTKKR TLRKNDRKKR YTVVGNPYWM APEMLNGKSY DETVDIFSFG 
151 IVLCEIIGQV YADPDCLPRT LDFGLNVKLF WEKFVPTDCP PAFFPLAAIC 
201 CRLEPESRPA FSKLEDSFEA LSLYLGELGI PLPAELEELD HTVSMQYGLT 

251 RDSPP (SEQ ID NO: 2) 



FEATURES: 

Functional domains and key regions: 

[1] PDX00004 PS00004 CAMP_PHOSPHO_SITE 

cAMP- and cGMP- dependent protein kinase phosphorylation site 

Number of matches: 2 

1 108-111 KKRT 

2 119-122 KRYT 



[2] PDOC00005 PS00005 PKC_PH0SPH0_SITE 

Protein kinase C phosphorylation site 

Number of matches: 4 

1 51-53 TLX 

2 106-108 TTK , 

3 107-109 TKK 

4 111-113 TLR 



[3] PDOC00006 PS00006 CK2_PH0SPH0_SITE 
Casein kinase II phosphorylation site 

Number of matches: 4 

1 51-54 TLKD 

2 76-79 SGMD 

3 139-142 SYDE 

4 ' 212-215 SKLE 

[4] PDOC00008 PS00008 MYRISTYL 
N-myristoylation site 

Number of matches: 4 

1 73-78 GIASGM 



FIG.2A 
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2 77-82 GMDKTV 

3 .. 150-1^5 GIVLCE • 

4 158-163 GQVYAD 

Membrane spanning stru cture and domains:. 

Helix Begin End Score Certainty 

1 142 162 0.872 Putative 

2 184 204 0.652 Putative 

BLAST Alignment to Top Hit: 

>CRA 1 1000682328847 /altid=gi [8051618 /defcref NP 057952. II LI M 
^ |10 ° domain kinase 2 isoform 2b [Homo sapiens] /or g^omo 

sapiens /taxon=9606 /dataset=nraa /lengtb=617 

Length = 617 

Score - 485 bits (1235). Expect = e-136 n _ _ 99/9 «- , M \ 

Identities = 241/265 (90*). Positives = 241/265 (90*). Gaps - 22/265 (8*) 

Q-y: 13 ~S«^ 1 

Sbjct: 353 LTCvSlS 

-fw„. GTASGM- — ----- .-DIOVWADFG^RLIVEERKRAPMEKATTKKR 110 

Query. 73 GIA5GM . D1 ^ w ^ F G^ RUVEERKR APhEKATTKKR 

Sbjct: 413 GIASGMAYWSMU^ 472 
Sbjct: 473 tSKSSemLN^TO 532 

Sbjct: 533 SgL^^ 592 

Query: 231 PLPAELEELDHTVSKQYGLTRDSPP 255 

PLPAELEELDHTVSHQYGLTRDSPP 
sb j c t: 593 PLPAELEELDHTVSMQYGLTRDSPP 617 (SEQ ID NO: 4) 

Hnmer search results (Pfam): : . Score lvalue N 

ffrvtel Description __: ■ rr^pj , le . 2 6 2 

PF00069 Eukaryirtic protein kinase domain 100.1 Lie to 

CE00031 CE00031 VEGFR . „ r ,. r( v™ 1 7 1 

CE00204 CE00204 FIBROBLAST_GR0tfTH_RECEPTOR 4.7 1 

CE00359 E00359 bone_morphogenetic_protem_receptor i.b /.» 

CE00022 CE00022 MAGUK_subfamily_d a.o 

CE00287 CE00287 PTK_Eph_orphan_receptor -«.J r?f .£ 

CE00292 CE00292 PTK_membrane_span -o 1 - 0 

FIG.2B 
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CE00291 
CE00286 
CE00290 
CE00288 



CE00291 PTK_fgf_receptor 
E00286 PTK_EGF_receptor 
CE00290 PTKTrk_family 
CE00288 PTKJnsul i n_receptor 



Parsed for domains: 



Model 
PF00069 
CE00022 
PF00069 
CE00031 
CE00204 
CE00359 
CE00290 
CE00287 
CE00291 
CE00292 
CE00288 
CE00286 



Domain seo-f 



2 

1 



16 
124 
81 
129 
129 
79 
9 
1 
1 
1 
1 
6 



s eq-t 
79 . 
153 . 
156 . 
156 . 

156 . 

157 . 
218 . 
218 [ 
218 [ 
218 [ 
218 [ 
218 . 



hmm-f hmm-t 
41 105 



187 
129 
1114 
705 
28 



216 . 
182 . 
1141 . 
732 . 
356 . 
282 [] 
260 [] 
285 [] 
288 [] 
269 [] 
263 [] 



score 
52.1 

1.5 
48.0 
4.9 
4.7 
1.8 
151.3 
-48.4 
113.0 
•61.8 
210.4 
125.1 



113.0 

125.1 
151.3 

210.4 



E- value 
2.3e-13 
2.5 
3.1e-12 
0.14 
1 

7.9 
6.5e-05 
3.8e-05 
0.027 
2;le-05 
0.014 
0.0021 



0.027 1 

0.0021 ] 

6.5e-05 1 

0.014 3 



FIG.2C 
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TCATCCTTGC 

I vr\ 1 w w 1 1 


JX 


AATGTATGT6 


XU J. 


GACTTGAACT 


XwlX 


CTGTAAAGAT 


CU X 


GATCCAGAAC 


251 


CACTGGGGAG 

\\f | %4w*^w^w »w* 


301 

w V J» 


AGCTGAGTAC 


351 

WW J» 


GGG6TGAAGG 


401 


GTGGATATGA 


451 


TTAACCCAAA 


501 

w V/ X 


TCTCAAGACT 


551 


AGTTATCTTG 


601 


TGGAAATAGA 


651 

w w X 


GTGGTCAGGT 


701 


AAGGTGTGGC 


751 

1 W i 


TCATAAACTG 


801 


CCAAAACTTG 

«V « www* ■ » 


851 

V w A 


CTAATTCATT 


901 


AGTGAGTCTC 


951 

w* 


CTCTGTGACC 


1001 

A w w* ^* 


GTCTGAGGAT 


1051 


CTATTACTGA 


1101 


TAAGGCCTCT 


1151 


AACTGTGTAC 


1201 


CAGGAGCTGT 


1251 


TCGGTGACfG 


1301 


AAAGCTGCAT 


1351 


GCTTTCTAGG 


1401 


TATTTCAAGA 


1451 


TGAGGCCTCT 


1501 


TGCTGGTCTC 


1551 


ATAGGTAAGT 


1601 


TCTACAAAGA 


1651 


TGGTAGAATG 


1701 


AATGCTAGAT 


1751 


TTAATTTTCC 


1801 


TATTTTGAGA 


1851 


CTTGGCATAT 


1901 


CCATTAC 1 1 1 


1951 


AATAACATCC 


2001 


GTGGATTTGC 


2051 


CATACAAAGA 


2101 

•* w> w w» 


CAACTGGTAC 


2151 


GTGAGCGGCG 


2201 


ATCAGTGGTG 


2251 


GCACATGCAA 


2301 


GGCTGATGAT 


2351 


CCCCAACCCC 


2401 


AAGGTTGAGG 


2451 


GTAAATGAGC 



GCAGGGGCCA 
CTGCTGAAGC 
TATTTAAAAG 
AACTAAAAGT 
AGGGGTGTCA 
TCTAACCCAG 
AGCAGATAGG 
ACAGGGAAGG 
GAGGAGAGAG 
GCAGGTACTA 
GGAGAAAGCA 
AATATCCTCA 
AATCATTCAG 
TACAGACTTA 
TCTAAAATAA 
ACCCTGAAAG 
GAAGATTCAT 
GTCTTGTCCA 
TCAGACTTTC 
ACTGTCACCC 
TTTAGACCCA 
GGCATGAGAA 
TGCCAGCTTG 
ATTGGCTGCT 
GCTTTGTCTA 
TCACTGTGCC 
ACACAGAGCG 
GCTGAGTCAG 
A GTTGA CATC 
CTTTTTGGAG 
TGAGCTCACA 
CHCCCTATC 
CTTTCCTGGT 
GAAAGAGCAG 
CCACCACTTA 
TATCCATTAA 
ATTAAATGAG 
AGGAGGCATT 
TCT7TTTCTG 
AAGTGAAAAA 

TTCTTAAGGA 
GCTTTGTATC 

GGGTCTGTAG 

GGCCAGTGAG 

GCATTAGATT 
GGGATCTGGG 
CTGAGTTGGA 
CAGCCTAGGG 
ACTGCTGATC 
TCTTGGATTA 



TGCTMCCTT 
GAGAGTACCA 
ATAAGGAGGA 
GCACTTCTTC 
TACCGAGTAG 
AGCT6AGATA 
GAAAAGAAGC 
GCTAGAGAGA 
TAGAGGGTCT 
AAGTATGTGT 

GGGCAAGCTC 
TGGTGGAAAG 
AGCCAAGAGA 
ATTTTGGGTT 
TGAGATGTGC 
CTCTTACATG 
TTGGATGTTT 
CTGTCCGTAA 
TGCCTTGGAG 
TAAAACCAAA 
GGAAGAATGA 
GAGTGGAATG 
TTTAACTCTT 
CCAGAATGAT 
TTCATGGCCC 
TGTGGCAGTC 

CACAAGGGAG 
GTACCACAGC 
TGAGCCATAC 
GAACATGGAC 
CAACCCTTCA 
TTGCAAGGCT 
TCCCCTCATT 
AAGCTTTAGA 
GCTAGTCAAC 
GTGAATATAA 
A7TAGGTCTA 
CATTAAATAT 
AACTAAAATA 
TCAACAACAT 

GCAGAGATTA 
TAGAACAGGG 

CCTGTTAGGA 

CATTGCTGCC 

CTCATAGGAG 
TTGCATGCTC 
ACAGTTTGAT 
TCCGTGGAAA 
TAGAGGACCA 
GGTGATGGAA 



CTGTGTCTCA 
GAGGTTTTTT 
GCCAGTGAGG 
TAAGAAGTAA 
CCCAGCCTn 
GCTTGCAGTG 
CAAAAATCTG 
CATTTGGAAA 
TGATTTCGGG 

TGATTGAATG 
TGGAGGGTAT 
TCCTGATCCT 
TTGAATTGTT 
AAAAAGTAAA 
TGGGGGTGGG 
TAAGAGTTCC 
GTGTTCATTA 
CCCAACCTGG 
TTTGTGAGAG 
AAGGCCCCTC 
GTGATGGGCA 
GGTGGGTTGA 
CTCTGGGGAA 
GTTGAGCAAT 
CTGTGCCTGT 
TGTAGTTACC 
TCTTGTAACA 

TTGATCTCAG 
CAGGAGTATT 
CGACTCTGTG 
CCCTCCTTTC 
CAGCTCAAGT 
GGAGTGAACA 
ATGAGCCAGA 
CCTGCCCCCT 
TAATACCTGT 
TGAAAGCACC 
TTGTTCTTCC 
ATACTTGGTT 
GAAAGAGCAG 
TGTAATCTAA 
GTCCCCAGCC 

ACCAGGCTGC 

TGAGCTCTGC 

TGTGAACCCT 

CTTATGA6AA 
ACCAAAACCA 
AATTGGCCCC 
ATTTATTCAA 
AAATCTGAAA 



GTCCAATTTT 
TGATGGCAGT 
GAGAGGGGTG 
GATGGAATGG 
GTTCCGTGGA 
TGGATGAGCC 
AAGTAGGGCT 
GTGAAACCAG 
TCTTTCATGC 
TCTTTGGGTT 
GGCAATAACA 
GTTTGAATTT 
GAGTAAGTGG 
AACAAGAAAC 
GCATGGCAGC 
AAAAATATTT 
AAATCTCTCA 
GATTGGTTTG 
AGATGGCATA 
TTGACAAGGA 
TATATATATC 
GGTGGTGTTT 
CGAGGGGGAC 
CTTGAAGTGC 
GAAACAGGGT 
CAGAGAGAAC 
ACCTTGTCCT 
CTGTCCTCTT 
GTATTTTGTT 
CTTTTGTCTA 
TCAGCCAGTG 
GTCAGCTTCC 
AGAGTTGACA 
CCTGAGTATG 
GCCTCAAG7T 
GTCACAGGAT 
TAGCAGAGTT 
CCTTTTATAC 
CTATCTCTGA 
TTCTTTTCCA 

CAGCCTCCAA 
CCTGGACCGC 

ACAGCAGGAG 

CTCCTGTCAG 

ATTGTGAACT 

TCTCACTAAT 
TCCCCCCGCC 
TGGTGCCAAA 
TGTTGGTTGA 
AAACAGGGCT 



FIG. 3-1 
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250 
255 
260 
265 
270 

275 
280 
285 
290 
295 
300 
305 
310 
315 

320 

325 

330 
335 
340 
345 
350 
355 
360 
365 
370 
375 
380 
385 
390 
395 
400 
405 
410 
415 
420 
425 
430 
435 
440 
445 
450 
455 
460 
465 
470 
475 
480 
485 
490 
495 



TTTGAGGAAT 
TGGCTGTTGG 
AGATTGTGTT 
CTAGAAAAAG 
TTTTTAGAAA 

CTGGAGAGAT 
GGCAGAAACT 
ATGCTGTTAA 
CTCCTGCTAC 
ATGTCTGGAG 
TGGTACAGGA 
GGGCCTATCC 
ATGTTCTGAT 
CTTAGGGTGG 

M6TGTGGCC 

CAGCTGGAGG 

CATATCCCTC 

TTCCTGTGGC 
TTCTTATTTT 
AGGGATTCTT 
CACTGCAATC 
CTGAGTAGCT 
TATTTTAGTG 
CTCCTGACCT 
TACAGGCGTG 
GCAGAGCCAG 
TTTGTTACTA 
TAnATTATT 
TACAGTGGTG 
GCAGTTCTCC 
CACCACACCC 
TGTTGACCAG 
ACAGGCGTGA 
CAGAGCCAGC 
TGTTACTAGC 
TGAGACAGAG 
TTGGCTCACT 
AGCCCCCCTA 
TTTTTGTATT 
CTCAAACTCC 
TGGGATTACA 
TGTAGGCAGC 
TTCCTGCTGT 
ATCTTGTTGA 
TAGACACCCA 
AAGGGTGGGA 
TTGTTGGTGG 
TGGCCCCCAA 
GATAGACTAG 
CTCTTGTTGT 



AGGAAAAGGC 
CTGGGAATAG 
CGTTTCTTCT 
GCAAGTTTTG 
ATGGAATGAG 

GAGCAGAGGT 
GTGCATCTAG 
CTCAGTCTTA 
TTTGGGCCTC 
GTGTCCAGGC 
CTGTCAACGA 
TCCCATCTTT 

GGAAAACACA 
GGAAAGGAAT 

CCCCTGAACC 

CAGGGTGGGG 

ACTTTATGGG 
TGCACTACAG 
ATTTTATTTT 
GCTGTTGCCC 
TCTGCCTGCT 
GAGATGACAG 
GAGACGGGGG 
CAAATGATGC 
AACCACTGTG 
CTGTTCCTTC 
GCTTTTATTA 
ATTATTGAGA 
CGATCCCGGG 
TGCCTCAGCC 
GGCTAATTTT 
GCTGGTCTGG 
ACCACTGCGC 
TCnCCTCAC 
TTTATTATAG 
TCTCGCTCTG 
GCAACCTCTG 

GTAGGTGGGA 
TTTAGTAGAG 
TGACCTCAGG 
GGCATGAGCC 
TCAGTTTCTT 
CTGAGGCTCA 
GATTGAATGA 
GTGAATGGTT 
ACTTGTCTTT 
AAAGAAGAGC 
GGTCTGAAGT 
GCAGGCACCT 
AAAACATCCC 



AGTAACATGT 
TCATAGGAAG 
TCTCAGAGCT 
TTTCAGTAGA 
ACTACTTTTG 
TGGACAAGTG 
CAGAGCATTG 
TTCTACATGG 
TCAACCTCTT 
TGTGGGGACC 
AACCTGGCAC 
ACCAGTGTAC 
GAAACAAGCT 
GTACCAAGGA 

CAGGTTAAAT 

GGATGAGAGG 

TGAGGAAACT 

ATTATGCAGG 

ATTTTATTTT 
AGGCTGGAGT 
GGGTTCAAGT 
GCACCTGCCA 
TTTCAACATG 
ACCCACCTCG 
CCCAGCCAAG 
ACCACAGGAT 
TAGCTATATT 
CAGAGTCTCG 
CTCACTGCAA 
CCCCGAGTAG 
TGTATTTTTA 
AGCTCCTGAC 
CCAGCCAAGA 
CACAGGTTGC 
CTACATTATT 
TCGCCCAGGC 
CCCCCCGAGT 
CTCCAGGCAC 
GCGGGGTTTC 
TGATCCGCCT 
ACCGGGCCCT 
AAAAATTATA 
GTTTCTTCAT 
AATAATATAT 
ATTCCTTCCT 
ATATTCTTCA 
AGTCCACTCC 
GGTAGGGCTG 
TGTGCTGTAG 
TGTGCTTATA 



TTAACCCAGA 
GGCTGACACT 
ATAAGCAAAG 

AAAAAGGATA 
AGGCCATGAG 

CTTACCAGAG 
GCCTAACCCT 
TAGGAATCCT 
GGTTTTGTGT 

ACATTGCTCC 
GGCTCTTGCT 
TATGGGCCAA 
TCTGAGTTGA 
AGAGCTCATG 

TGGAAGAGCC 

AGCCCTTTCC 

GAGGCCCAGG 

TACTTCAAGA 

ATTTTATTTT 
GCAGTGGTGC 
GATTTTTCTG 
CCATGCGCAG 
TTGGTCAGGC 
ACCTCCCAAA 
AGTTGTTTTT 
GCCTCCCTAG 
ATTATTATTA 
CTCTGTCGCC 
CCTCTGCCTC 
GTGGGACTAC 
GTAGAGACGG 
CTCAGGTAAG 
GTTGTTTTTA 
CTCCCTAGGT 
ATTATTATTG 
TGGTGTACAG 
TCAAGCAATT 
CTGCCACCAC 
ACCTTGTTGG 
GCCTCGGCCT 
GCCTATAGCT 
CAGACTTCAA 
CTGGAAAATG 
GCAGTGTATC 
CCCATCGGAT 
CAACGTAAAA 
AGAGGCTGGA 
TGCCTATATC 
ATTCCAGCTC 
CCAAGTAATT 



US 6,340,583 Bl 

GAGAAGTTTC 
GAAAAGAAGG 
GCTGAAAGTT 
ATCAGAACCA 
TTCCTTGTCC 

ATCTTGTGGA 
TTCAAATGAG 
GTCCCTTTGC 
GCAGGTGAAG 
AAGCCAGATA 
TCCGGTAGGT 
GCACTATTTC 
GAATTTCAAT 
ACCAAACCTC 

ATAAATGGGC 

AGGGTTGTCC 

AAGAGTGACT 
GTTGTTTGTA 
ATTTTATGAG 
AATCTCGGCT 
CCT TAGCTTC 
CTAATTTTTG 
TGGTCTTGAA 
GTGCTGGAAT 
AGTGTGGTTG 
GTTCCTACTT 
TTATTATTAT 
CAGGCTGGTG 
CCGAGTTCAA 
AGGCGCCTGC 
GGTTTCACCT 
TGCTAGAATC 
GTGTGGTTGG 
TCCTACTTTT 
TTATTATTAT 
TGATGTGATC 
CTCCTGCTTC 
GCCCAGCTAA 
CCAGGCTGGT 
CCCAAAATGT 
ACATTATTTT 
ATCAGATTTG 
GATGGTAATA 
CAGTACATGG 
TGGAATTCTC 
TAGTTGAAAT 
TGGGCATGCC 
CTGAGAATGA 
CTGCACATAG 
GAGTTGACCT 



FIG. 3-2 
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TTAAACACTT 
CGTCTGGCCT 
TTTCAGCTAT 
TTGGTTCTTG 
TCCATTTAGT 
TAGTTAAAGC 
TAACGTCAAA 
ACCCATATGT 
ATCCAAAATA 
GATTCTCTTA 
ATGGTATAAC 
GCTCAGGCCA 
ATTTGTAGGT 
TGTGTGTGTG 
TGTACACAAA 
AGTACAGGCA 
GGGAGGCCAA 
TGACCAACAT 
TTGGCATGGT 
AGGAGAATCG 
TGCCATTACA 
AACAACCACC 
TCCTGGCTTT 
AGGCACCAAT 
TGTTAGGGAG 
TCTGACATAT 
GTAATCCTAG 
GAGCCCAGGA 
TACAAAAAAT 
CCCAGCTACT 
TCAAGGCTGT 
GGACAGAGTG 
CTTAATAATC 

TATACACCTA 
GATGTGGAGC 
CAAGCCAAAT 
CCTAGTTGCA 
AAGGAGCACA 
GATGTCACGG 
GACTGTCTTT 
TCAGGGCTAG 
TTGAGGAACA 
GCCAGCTTGC 
GGCCTGCTGG 
CTTCAAGGCC 

MTAAAGGAA 
CTAGTCACAG 
TCCTAAACTC 
TGCAGCCACA 
GGTGCCCAGG 



GCCTCTTCCC 
CTGGAAGAGT 
AACTCAGAGC 
CCCCTTTTAC 
ATGACAGGAG 
CTTTGGCTGG 
ACCCAATGAG 
TCATATTCTT 
AGCAGGACAG 
GAGCTAAAAA 
CCATTCATAT 
TGGATGACAA 
CTAGTCAGAT 
TGTGTGTGTG 
TACATAAAGA 
GGCCAGGCGT 
GGCAGGTGGA 
GGTGAAACCC 
GGCACATGCC 
CTTGAATCCG 
GTCTAGCCTG 
ACCAAGAGTA 
GCAATTTATT 
CTGTAAAATG 
GATTAAATGT 
AGAAAACTCT 
CACTCTGGGA 
GTTTGAGACC 
ACAAAAATAT 
TGGGAAGCTG 
AGTGAGCTGT 
AAACCCCTGT 
AGTAACTGTC 
TATGTATACA 
CCCAGGGATT 
ATCATTCCCG 
GCTTTACCTT 
TCTCCTGACT 
CGCCCTCAGA 
TGGATCTAAT 
GAAGGATTGT 
TGGGGCATCA 
CAGTGTTTCT 
GAGGAGGACT 
CCCTTCCAGC 
ATGACTTTTC 
GGTGGGGCTG 
CTCCCCTCAT 
TCCATAATCC 
ACAACAGGAA 



TGGGAACCAT 
TGGAAAGCAG 
TCTCAAGTCT 
TCCCAGGGAA 
CAGAGAATGT 
TCCTTTCATT 
TTCACAGATT 
GCTGTTTTCC 
GGTAGAGCAA 
ACTTCAGAAC 
CACAGATGAG 
GAGCTGGCCC 
GCTAGCTTGT 
TGTGTGAGAT 
GGAAGTAGAC 
GGTGGCTCAC 
TCACCTGAGG 
CATCTCTACT 
TGTAATCCCA 
GGAAGCAGAA 
GGCAACAAGA 
CAGGCTATGG 
AACTAGCCTT 
AGGATMGAA 
GATAACCTAT 
TAATAGGGCC 
GGCCGAGGCA 
AGCCTGGCCA 
TAGCCAGGCG 
AGGAGCGATG 
GATCATGCCA 
CTCAAAACAA 
ACTTTATATT 
TTTCTCTTAT 
AAGGGCAACT 
TGGAGGAAGT 
GAGGACAGAG 
TCTGAGCTTT 
TAGAGCCTGG 
TTGACTTTTG 
ATTTGTCTGA 
AGCTGAATGG 
CTGATGAATT 
CTCCCTCTGT 
CTTGCTCTTA 
TTCTCCCCTT 
GATATTGAAT 
CTCTCCCTTA 
TGCCTCTGTT 

GCTACTTAAA 



ATAGGGGATT 
CCATCATTAT 
TTTCTGTGGA 
GTTGATTCTG 
CAGAGCTGTA 
TTATAGCTGG 
GGGTCTCGCC 
TATGTGTATG 
GTTAATCTTT 
TAGAAGAAAC 
GCCTGAAACC 
TAGCACTGAA 
TAGCTCTGTG 
AGAGACAGAA 
ACGTTAGCAT 
GCCTGTAATC 
TCAGGAATTC 
AAATACAGAA 
GCTACTTGGG 
GTTGCAGTGA 
GGGAAACTCC 
AATGAGACTA 
AAGTGACTTC 
TATTACTCAT 
ATAAAGTGGC 
GGAGGTGGTG 
GAAGGATCGC 
ACATGGCAAA 
TGAtGGCACA 
ATTACCTGAG 
CTGTACTCCA 
AACAAATGAA 
ATGTTGTGAG 
TACACATTCA 
TTGAACTACC 
AGAGTATCTA 
ACTCTAATCC 
CCCCTGGTAA 
TAATTTGCCC 
CCCCAGTTGG 
CCCGAGAGAT 
TCTTGTAAGA 
TAGAGTACCT 
GCTACTCAGA 
CCCAGCTGGG 
CCCCCAGTAC 
GGAGAAATTG 
CATTACCCCA 
AGCCTTCCGA 
GCTGGAACCT 



GGCCTGGAGA 
TATCCTTTCC 
TCTTATTGCC 
TCTTTTCTGT 
AGGGACCTTA 
GACTAATAAG 
TTGGCATGTA 
AATATTTTCT 
GGAATTTCTG 
CACCCACTAT 
AAAAAGACTT 
CTCTTGGGTC 
CGTGCGTGTG 
AGATAACATA 
GGTAGATAAG 
CCAGCACTTT 
GAGACCAGCC 
AAAAATTAGC 
AAGCTGAAGC 
GCCGAGATTG 
ATCGCAAAAA 
TGGTTTTAAA 
CCTGAGCTTC 
GCCAC ATGGT 
TAGCATAGCA 
GCTTATGCCT 
TTGAGCCCAT 
ACTCCACCTC 
CACCTGTAGT 
CCCAGGGATA 
TCCAGCTGGG 
AAAAAAAACC 
TGTGTGTCTA 
TTGGTGATCT 
CTGACACAAT 
GGTTCTGTCT 
AGCTGTGCTG 
ATTCAAACTG 
TGGGGAGAGT 
AGGAAAATa 
AACCTGGGTT 
TCTCTCCCAC 
GAGTAGTGCA 
GAAATTCATT 
CTACAGTTAC 
CI UGH HC 
CTGGGGTCCA 
TTCTTCTGTC 
CAGACCCTCA 
CAGACTGTGC 



FIG. 3-3 
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MTGGAGGCC 
GTGCGATTAG 
TGACTAGTCC 

TCTTTTCTTT 
GCTAGTGAAG 
CTGGTAGTGA 
TAGGCCTTTT 
TGCCATTAAT 
GGCTCTCTCT 
TTGTAGCTGA 
GCATGTTAGC 
TCACCCTGAT 
CTATTAGTCT 
CTAGCAGCAT 
CTTTCAGGAC 
TCACTTTTGA 
GCAGATAGAA 
MTTTAGTGA 
AGAGACTGTG 
CTGAAAGCCA 
GGTAGAATCA 
CAGACACCCT 
TGACTTAATA 
TGGAGGTTAC 
TTTGAACTCA 
TTGAATGCAA 
ATAATATGGG 
CAGATAACAT 
TCAACAAAAG 
AGTCAGTGGA 
CACACTTACT 
GTGTTTACTC 
ATCATTAAGG 
AAGGAAGGGC 
CCTGACCACA 
AATGTGCCTT 
ACAGATGTTT 
GATGAGGCCA 
CCTC ACTTAG 
CCIIIIIIIG 
GGTACAATCA 
TCCCACCTCA 

TTGCCATTTT 
TTGCCCAGGC 
AAGTGCTTGG 

CCATTTTATA 
CAGGGTCACA 

AGTCTGCTTT 
AGACTTGGAG 
TGTGTAACTG 



AGTGACAAAA 
GCAGCTGGCC 
TCCCAA GCCT 
TTTTTCTTTC 
TGAAATTGTG 
TCAATTACTT 
TTAAAACTCT 
TCTGTATCTG 
CCTTAGGAGA 
GGTTAAAGTG 
AAGCCAGAGG 
CAAAGGTGTT 
TGATGGCCCC 
TATCAGAAGG 
TTCCTTTCTC 
AAAGCGGTTA 
GACTGTGGTC 
CTCCAGGAAG 
TTGTAATATG 
CATTCATTTT 
TAATTACAGT 
CCTGAGCATA 
AAATGTAGTA 
TTGTTTAAAG 
GGCATTCTTA 
GCATATTTCT 
GTAAAGAGCC 
TGAAGGGTGT 
TGCTGTTAAC 
GCAACCCTGC 
AACCTAAACC 
ATTTGTCCAG 
CTGGCTATTG 
AGACATCTGG 
GAGTGGTACT 
AATCCTGCTC 
CACAGCTTCT 
GAGATAGATA 
ATTATCTGAT 
AGAACAGGGT 
TAGTTCACTG 
GCCTTGCAAG 
TTTTTATTTT 
TGGTCTTGAA 
GATTACGGAA 

CTAAAACAGG 
CAGATGATAT 
CCACTAGGAC 
ACAGAAAATC 
TGGGCAAGTT 



CTGAAAGTAG 
AGAATCTTTT 
TCCCAACAGG 
TTTCTTTCTT 
GGAGTGGAAA 
GTAAACACTA 
GAGTTACCTC 
GGGCAATCCT 
GGCCAGGAGA 
TGGAGCTATC 
ACCTTGACAA 
TGGCTTAGGA 
AGCGTGGGTC 
AAAATCCACC 
AGGATTGCAA 
CTAATACCTA 
ACTGCATCAG 
GCCAGTGAAG 
TTGGCTGACA 
CTCTCCCCTC 
AATAGGTACC 
CGACATGCAT 
CTAGTCTTAC 
TCACAGAGCT 
CTCCTTGCCT 
TAACCTCACT 
CTCACCCTGC 
TAGTTTAAAG 
TTTCTTCTGG 
CATCTGCTGT 
TTTGATTCTG 
TTTATCTTTT 
GACAGGGGGC 
TTCTTCCTCT 
CCTAGGATGT 
TTTACTTTGA 
ATAGGAGGCA 
ACTGATATTA 
TCAATCTTCA 
CTTGCTCTGT 
CAGTGTCAAC 
CAGCTTGGAC 
AAGTAGAAAC 
CTCCAGCGAT 
GTAAGCCACT 

AAGGCCCAGA 
TTGAACTCAG 

TCCCAGGAGA 
TGATTTGAGT 
CCTTAGCCCC 



CTCTGTCAGT 
GGATCTCCTG 
CCTCTTTTTT 
TCI I II llll 
AGGAACAAAG 
TTGTACTTGG 
TCTTTCCTTT 
TTCTGATGTT 
GTAGCCAGAG 
AATGGTGACC 
CTTTTTTGAT 
GGAGGGAAGA 
TCTATTGCTT 
GCTCTTMGG 
ACATAAGACT 
TACTCTGGGA 
GCAACAGACC 
AAATAACACA 
GCAGGGTACT 
ATCCCCATCT 
ACTTATTGAG 
AGCACATTTA 
CTACTTCGAG 
MTAGGTAGC 
GCAAGAGTCT 
GAGGCTCAGT 
CTGCCACACA 
GCTTCATGGA 
GTCTCAGGCT 
TATGCTGTTG 
GCTGTGGCCT 
AGGAAACAGC 
TGGGGCCTGC 
GCCCCTACAA 
AGCAGCAGCA 

GAAGAGAGAA 
GAGGTAGAAA 
ATTAAACGTT 
TAATAACCCT 
TGTCCAGGCT 
CTCCTGAGCT 
TACAGGCGTG 
AAGGTCTTAT 
CCTCCTGCCC 
GTGCCTGGCC 

AAGGTTTGGA 
GTCTCCCTGG 

AAAAAAAAAA 
CTTAGTTGAG 
TGTGAGCCTC 



AATTGTGCTG 

GACATATGGC 

TTCC IIIIII 

II 1 1 1 1 MAG 

AAATCGGTAA 

ACCAGCCCAG. 

CCTTGAGCAG 

CTCTGGACCT 

AGCATGTCAT 

TGGCCTCTTG 

GATTGTCCGT 

AAAGCTACCC 

GACCTGGTTC 

CTCCTGGGAA 

ATTTGAGCTT 

AAGGGCTAAT 

ATTTCCGCTA 

CGTAGCAACC 

TTCTGTGATG 

AAGCAAGCCT 

TACTCTGTGC 

ATCCTTACAA 

AATAGGGAAA 

ATAGCTGAGA 

CTTGGCATTC 

TTCCTCTTAT 

CTGGTAGTGT 

CTCTATAATG 

CCTGATGTAG 

ATGTTGCTGC 

TCTCCAGAAG 

CAGCCCGTAG 

CTGACAGAGG 

GAGACTCCAG 

TATGAGCTTG 

CTAAGGACCC 

AATGGAGAGA 

GTATTAAGAA 

GCAACCCCCA 

ACAGTGCACT 

CAAGCAATCC 

CCACCACACC 

TAATACTATG 

CAGCCTCCCA 

AGTGCAACCC 

GTAACTTGTC 
CTCCCAAGAG 

AAAAAACAGT 
CTAGGCTAAC 
AGTTTCTTAT 



FIG. 3-4 
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CTGTAAAATG 
AA6GACTCTG 
ACATGGCAAC 
CATTTGCTAT 
GTTGGCATAC 
TTTTTGAGAT 
ATGATAGCTT 
GTAGCTGGGA 
ATTTTTTGTA 
TCCTGGGCTC 
AGGTATGAGC 
AGCTGGAAAT 
TGGCTTTCAT 
CGAATAACAG 
TTATATGACC 
AAAGTTCGGA 
CCACCTGACC 
GTCTATGAAA 
TTCAATCATG 
CTTAACAGAT 
CTCTTCCCTT 
CTCTCTGGTA 

TTACCATTCC 

CTCATGACTT 

TATCTCCAGC 

ATCAACCTGG 
TCTCTAGACT 
TCCACTGAAA 
TAAGCCAGAA 
ATCCAAGCCT 
CTTTCCTTTT 
TGGATGTCTG 
CTGTTCTCTA 

TTTGTT1TAG 
CACAATTTCG 

CTGCCTCAGC 

AGCTAATTTT 

GCTAGTCTCG 

AAGTGCTGGG 

TCTTAAAAM 

GTCTCTCCTA 

GACCAAAATC 

AGCACTTTGG 

CCATCCTGGC 

TAGCTGGTCG 

AGGCAGGAGA 

ATCACGCCAC 

AAAAAAAAAA 

CGTAATGTGG 

CAGCCTCACC 



TCATAAAAGA AATCCATCTC ATGGAGTAGT TGTGATGATC 
AAAACATTAG AATGGnTAA TGTGAAGGAT TAGCAGCAGC 
ATTGTGCATC TTATATTAAC TATCCAAATA TATCAAGCGT 
ATATAAAAGT CATCAAATTA GGCACTGTGG GGGATACGGA 
TAGCCTGGCC TCTTAATTAA TTCATTAATT AGCTTATTTA 
AGGTCTTGCT CTATTGCCCA GGCTGGAGTG CAGTGGCATG 
ACTATAGCCT CAATCTCCCA GGCTTAAACA ATCCTCCTGA 
CTACAGGCAC ACACTACCAT GCCCAGCTAA iiiiiiniA 
GAGACAGGGT CTTGCTCTGT TGCCCAGGCT GGTCTCAAAC 
GAGATCCTCC CACCTGGGCC TCACAAAGTG TTGGGATTAC 
CACGGCACCT GGCCTGGTCT CTTAACTGGT TCCCTAAGAC 
AGAGAATGTC ATGGAGCATT CCTAACCATG GGCTCCAGCC 
TCTGTTTCTC CCCTGAAACA ACATTCCTTT AGTAATATTC 
CTTCATCAGT CTGTCTACCG ACCACTCTTC AGGCTTCATC 
TCCCAAACTG CACTAAGGGT TGTATTAGAG AAAAGTGGAT 
GTCAGGCTGC TTGAGCTTAA ATGCCAGCTT CACTTACCAG 
ATGAGTCAGC TGCTTAACCA TTCTTTGCCA CAGTTTCCTT 
AGGGAAATGG CTCCCACCTC AAAAAGTTGT TAACATTAAA 
TATTCAAAGT CCTGAGCAGA ATGTCTGGCC ATGACTGGGA 

gttagcattt attattagta tctgtcagtc ttcaaatgtt 
gqctttchg acahccaca ctctcctggt tttctcttac 

ATACCTGTTT GCTTATCCTT CTTTGTCCAG GTCTGGGATG 

TTCAGGCGTG CTGTTTTCTC CTTAGGCAGT CTTACACACA 

CCTTCCA1TG TCCTCCACAC ACTGATGACC CTAAAATCAG 

CTAAACCTTT CCACTGAGTT CTAGACCCAT ATGTTGTACT 

CTTGTCCATT T6AATGTC7T CCAGGCACTT CAGACTCTCT 
TTGCTGGACT nCACTCHC CCCCTAAAAC TGGCTCCTCT 
CATGTATGTC ATTGAGAGGC ACCACCATCC ACCCAGTGCC 
ACCTAGGAAT CCTTGATACC TGTTCTCTCT CATCCTGCAT 
ATCAGTTTTA TCTCTAAATT ATATTTTGGT AGGTTTACTT 
CTCCCACCAC CACCCTGCTC CAAGCTACCA TCATCTCACC 
CAATAGCCTC ATCTCCCACA GCCACTCTGC ACCC CCTAAT 
TAGAGCAGTT GGAAGGAGTG ATTTTTGTTG TTTGTTTTGT 
ACAGAGTCTC ACTCTGTTCC CCAAGGCTGG AGTGCAGTGG 
GCTCACTGCA ACTTCTGCCT CCCGGGTTTA AGCAATTCTC 
CTCCCAAGTA GCTGGGATTA AGGCACCGGC CCCCATAGCC 
TATATTTTTA GTAGAGATGG GGTTTTGCCA TGTTGGCCAA 
AACTCCTGAC CTCAAGTGAT CCACCTGCCT CGGCCTCCCA 
ATTACAGGTG TGAGCCACTG CACCTGGCTG GAAGGAGTGA 
AAAAAAAACA AAAAAAAACT TGACTGTGTC ACTCTGTGTT 
CCTTGTATAC TTCCACAACT TCCCAGTGTT CTTGGATAAA 
CTTAACTTGG CCAGGCGCGG TGGCTCACAC CTATCATCTC 
GAGGCCGAGG CAGGCAGATC ATGAAGTCAA GAGATTGAGA 
CAACATGGTG AAACCCCATC TCTACTAAAA ATACAAAAAT 
TGGTGGCGTG TGCCTGTAGT CCCAGCTACT TgGAGGCTG 
ATCACTTGAA CCTGGGAGGC AGAGGTTGCA GTGAGCCCAG 
TGCACTCCAG CCTGGTGACA GAGTMGACT CCATCTCAAA 
AAAAAAAAAA TTCCTTAATT TGGCCTACAG TAGAGCCCTC 
CCTCTCTCCA CATCTCCACA ACCTCCTGCT CCCTGCACTT 
TCTCTTCTGG ACAGGCCCTC CTTCTGACAA GGGCTTTGTT 

FIG.3-5 
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CATTCTGCTC 

CT6CTTATCG 

MTTCCTCAC 

ACCTTTCCTT 

TTCATGTGAT 

TTGGTCATGG 

TATTGTTGGC 

AAGAGATATG 

CCCACATCCT 

AAGACGGAAT 

GTGTCAGGGT 

CCCATAGTGA 

AGCAGCAGGT 

CAAGCCCTAG 

AGCTACTCTA 

TTTCAATTCC 

GTATATTTGA 

GAAAAAATGG 

TGGGGAGTGG 

AGTTTGTCCT 

GTATTGTGTT 

AAATCCTGGT 

CCCTTTGTGT 

AGTGGCTCAT 

TCACCTGAGG 

TGTCTCTACA 

AATCCCAGCT 

GGCAGAGGTT 

GACAAGAGCT 

TACAGGCTGG 

CTAGGCGGGA 

ACAGAGCAAG 

GTGGCTCATG 

TGCTTGAGCC 

CATGCCAGCC 

AAAGAAACGA 

GGAGGCCAAG 

GGCCAACATG 

GCATGGTGGC 

ACGAGAATCG 

TGTCACTGCA 

TAAATAAACA 

TGGAGGCCAG 

AGGCCGAGGG 

MCACAGTGA 

TGGTGGCAGG 

ATGGCGTGAA 

TGCAGTCCAG 

AAAAATGGAG 

GGAGGTCGAG 



CCTCTGCCTA GAATGCCCCC TTACTCTGTT 
TTTAGATCTT TACCTGGATG GCTCAGAGM 
CCTGAAAAAT AGGTTAGGTC CCTGTtTTAT 
TGAGGCTTTT TTTAAAAAAG TAGTTTTAAT 
CAtCTCCTTA ATGATATCTT AAGACCTCTA 
ACTGTGGGGT TTTTGCCCCT CATTGTGTCA 

ATAGGAGGGA TATTTGTTGA ATGAATTGCT 
ATGTAAGTCA GGCTTTTCCC TGCCCTTCCC 
TCCTATAGCA GCCACCGTGG CTGCAGTTAC 
CAGTTCCGGA CATTGGGTTG TTTTAGAAAA 
GATAAGTTAA AGCTTTGTCT TTTGCCCTCA 
GTAGAAGCCA GAGAAGCTGA CCCCAGGAGT 
CTTGAGCTGC ACTTCTCTGT AGCTACAATC 
GTACCTCCGG AGAGGAGGGC AAGAGAGGAA 
GCCACCAAAC TGATTATGAA TTGCCCTGM 
AATCGTAAGT TTGTTTTGTT TCATTTTGTT 
AAGATGGCAT TAACTAAAGA TATATATTCA 
AATACTTGCA TAGTATCTTT TACTTATAGG 
GGTGGATAGG TTGGCAGTTC CCCCAAGAAG 
CTGTGAGTTG AACTAATTAG ATCCACAAGT 
GTAGTTAAGA GCACACTCTA GAACCAGATT 
TCTGCCTTTT ATTATCTGTG TACTTTGGGC 
GCTTCATTTT TCTCATCTAG AAAATGGAGA 
GCCTATAATC CCAGCACTTT GGGAGGCCGA 
TGAGAAGTTC AAGACCAGCC TGGCCAACAT 
AAAATACAAA AATTAGCCAG GCATGATGGC 
ACCCAGGAGC CTGAGGCGGG AGAAACACTT 
GTAGTGAGCC AGGATTGCAC CACTGCACTC 
AGACTCAGTC TAAAAAAAAA AAAAAAAAAC 
GTGCAGGGCT TACACTTATA ATATCAGCAC 
GGATTGCTTG MCTCAGGAG TTTCAAGATC 
ACCTCATCCC CACAAAAAAT CAAAAATTTA 
CCTGTGGTCC CAGCTACTCA GGAGGCTGAG 
CAGGAGGTTG AGGCTGCAGT GAACCATGAC 
TGGATGACAG AGCAAGACCC TATCTCAAAA 
GCCAGGCGCG TTTGCTCACG CCAGTAATCC 
GCAGGTGGAT CACTTGAGGT CAGGAGATCG 
GTGAAACCCC ATCTCAACTG AAAATACAAA 
ATGCTCCTGT AGTCCCAGCT ACTCACTTGG 
CTTGAACCCA GGAGGCGGAG GTTGCAGTGG 
CTCCAGCCTG GGAGACAGAG CGAGACTCTG 
TAAAATAAAA TAAAATAAAA TAAAATAAAA 
CAGGCACGGT GGCTCACGCA TGTAATCCCA 
GGGCGGATCA CAAGGTCAGG AGATCGAGAC 
AACCGCGTCT CTACTAAAAA TACACAAAAT 
CACCTGTAGT CCCTGCTACT CAGGAGGCTG 
CCCGGGAGGC GGAGCTTGCA GTGAGCTGAG 
CCTGGGCGAC AGAGCAAGAC TCTGTCTCAA 
GTTGGGCGCG GTGGCTCGCG CCTGTAATCC 
GCGGGCGGAT CACCTGAGGT CAGGAGTTCC 



CACTTAACTC 

ATATAGAAGT 

GTTTTCATAG 

CTCACATTTA 

ATAGAACAAT 

GCACTGAGCA 

AGAGGTGGCC 

CTTCCCCTTC 

TGTAAATGGC 

TTGCCTGCAA 

GAGGAGCTAT 

CCTTCTTTCC 

CAGGCAGGAA 

GAATGAGTTC 

ATCTGAAAAA 

TTCTTAAATT 

ATATAGAGTG 

TGATTTATGA 

TTGGAAATGA 

AATGAAAGCA 

GCTTAGTTTC 

AAGTTACTTG 

GGCCAGGCGT 

GGCGGGCAGA 

GGTGAAACCC 

GGGTGCCTGT 

GAACCTGGAA 

CAGCCTGGGT 

AAACTGGAGA 

TTTGGGAGGC 

AGTCTGGGTA 

GCCAGGCATG 

GCGAGAGGAT 

TGCACCACTA 

AAAAAAAAAA 

CAGCACTTTG 

AGACTAGCCT 

AATTAGCCAG 

AGGCTGAGGC 

GCCAACATCA 

TCTCAATAAA 

TAAAAAAATA 

GCACTTTGGG 

CATCCTGGCT 

TAGCCAGGCA 

AGGCAGGAGA 

ATCGCGCCAC 

AAAAAAAAAA 

CAGCACTTTG 

AGACCAGCCT 
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GGCCAACATG 
GCACGATGGC 
AGAATAGCTT 
CACTGCCCTC 
GAAATGGAGA 
GTATTACTGT 
GCACTCATGA 
CTGCCAGGGC 
AAGAAGTAGT 
GGCCGAGCGC 
GGTGGGCAGA 
GGTGAAACCC 
AGGCATGATG 
GGAGAATTGC 
GCCACTGCAC 
AAAAAAAAAA 
AGAAAGTGAA 
TGCTGAATCT 
GCTAGTTTGG 
ACTCATTTCC 
CCTAATTCAG 
TACCTGTATA 
TGTACAGAGG 
ACACCTCTTT 
GGCCTAGATT 
CTGGGTGGTC 
GCTTCTTTAA 

actccttcct 

cttgcctaag 
gtgttccctg 
ttgagAtatc 
tgtcaacaca 
gtatcatgga 
gcttatgtgc 
aggcagtaga 
acactagttg 
gactttaggc 
cagcattctc 
tgcttaaata 
ctagcagact. 
ggcctcctgt 
acctgctcag 
ttgcacatgt 
agcctattag 
cgccaaatcc 

TTGTTATTCT 

TATTTGTTTA 

GTGGAGCCGT 

TAGAATGTAG 

T6AACGACTC 



GTGAAACCTT 
AGGCACCTGT 
GAACCTGGGA 
CAGTAGAGTG 
TACAAACTTA 
GAATAAAAGT 
ATGTTTGTTC 
TGAATAATAT 
TAIIIIIMC 
AGTGGCTCAC 
TCATTTGAGG 
TGTCTCTGCT 
GCAGGTCCCT 
TTGAACCCAG 
TTCAGCCTGG 
ACCAAAACCA 
AAGTTAGTGA 
AGATTTGGTC 
CTGACTCACC 
CTTTATTTTA 
CTTCCTGGGA 
GGGGATATGA 
GTTTGATAAA 
GGACATTAGG 
AGGGTCATTC 
CACCAGTCAA 
GCATTGACCT 
7TTACAGAGG 
ATCACACGCA 
CCAGACGAGG 
CGAAGCATTT 
GCATTTGTTA 
GCAACAGTGG 
TTACAGCCCA 
GATGTGGCCC 
GCCAGCAAAT 
ATTTATACCC 
TTTGGCATTT 
CCTCTGATAG 
TGGTCTTTAG 
CTMTTGCTG 
CGTTATATGA 
TGTTCCTTCA 
AGTCTGCCAA 
ACCCATACCT 
CTTCGTTATT 

CACATCTAGC 

ATCTAGTTTG 

TGGGTGCTCA 

TTTGGACACT 



GTCTCTACTA 
AATCCCAGCT 
GATGGAGGTT 
AGA7TCCGTC 
CTACCTACCT 
GTGTGTAGCA 
TTTGTTATTA 
GTGTGAATTG 
AATTAAAACT 
ACCTGTAATC 
TCAGGAGTTC 
AAAAAAAAAA 
GTAATCCCAG 
GAGGTGGAGG 
GTGACAGAGG 
ATATAATAAA 
AGCAAAACTA 
ACCAGAATAG 
ACTGCCAGGA 
AGTCCATGCT 
TACTTAATAA 
GTGTTCTGAT 
TGGTTAGGTC 
AAGGTCAAAA 
ACCAAGAAAA 
CCTTCCTTTG 
GTAATGGGTA 
AAGAAGTTGA 
GATTTTCTGT 
GCTTTTTTCC 
TTCCCAGTGC 
CTCAATGTTA 
ATGA7TATCT 
TATAGACAAA 
CAGGACAAAG 
TTCACATGGG 
ATTCAGAGAG 
CAGCTTTGCG 
CTCTTCACTG 
TGCTCTGCCC 
CCCATATGTG 
GCATACCATA 
GGCCAGAATG 
TACCATCCCA 
CTCCCCACCA 
CTCTTCATAC 

ATCACTCTTA 

TCTTTGTATC 

GAGTGTTTGC 

TGAATAAAGT 



AAATTACAAA 
ACTTAGGAGA 
GCAGTGTGCT 
TCAAAAAAAA 
CCTTACAACC 
CTGGGAACAC 
GTTACTAGAG 
GTGATTGTCG 
TAGTTTAAAA 
CCAGCACTTT 
GAGACTAGCC 
AAAAAGTACA 
CTACTTGGGA 
TTGTAGTGAG 
GAGACACTGT 
TAAGTGGCCA 
GTACTGTATT 
GGTCCTTTGT 
TGAAATTTCT 
CACAGAGCAA 
CAGGAAGGGT 
TTTAATAGTC 
AGAACCATCA 
ACCTGAAAGG 
CATCAGCCTT 
ATCACACCTC 
TGGAATTTTT 
AGCCCAGAGA 
TAACCAGGGT 
TTGAATTGCC 
AGCCTGGAGA 
GACATTCAAT 
ATAAGGGGTT 
TATCAGCTGT 
GCATACTCTG 
CATATACACG 
CCAAACTGGC 
TTCTGTTAAA 
CCTGTAGGCA 
CTACTCTCTT 
CCATGCACTA 
CTCTTTATGC 
CCTGTTACTG 
TCTTCTGTGG 
ATCAGAGACT 
CTCAGTTATA 

GAGTGTGAAA 

CCAGAGCTTA 

TGGGTGAATG 

CCATCCAGTA 



AATTAGCCAG 
CTAAGGCAGG 
GAGATCGCGC 
AAAAAAAGAA 
TACCCTCACA 
TATTCACAGA 
AGGCAAATGT 
CACATATCTA 
ACCAATATAA 
GGGAGGCCGA 
TGGCCAACAT 
AAAATTAGCC 
GGCCGAGGCA 
CCGAGTTTGT 
CTCAAAAAAA 
GCAATGAAAC 
CAGATAAAGA 
GGCAACCTGG 
TTCAGTGGCT 

CCTTCTGATG 
CTGGAAGTAG 
AATTCATAAG 
CAGAATGTCT 
CCAAAAGCTA 
GAAGAGTTCT 
CTTCCTCGTT 
TGCTCACCTA 
GATTTAATGG 
GATTTTTCAG 
TAGAGATTTC 
AGGATGTCCC 
TTTCTAATTA 
GCAATTCCAT 
TAAAATGACA 
CTGTTAGTGA 
GCCAACTGTA 
AACTAAAGAT 
AATCACTGCT 
ACTCTTTAGC 
CCACCATTCt 
GAGCTTACAG 
CTCAGTGCAT 
CCTGGCAATC 
AGGAGCCCCC 
TCTTCTCTCT 
TCCATTTCAG 

TTCTCCAAGT 

GCAAAGTGCC 

ATGTATTTGT 

TGCACCATTA 
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SE S£ ffl SS 

CATCAAAACC AMCCTCAA GGAATCGCAT 
rrrirrTGCA ACTGTAGCAA CCTGCTGTGC TTATTTTGCC GTGTTTTTCA 
S S GUXXTTCTC CCATGGGCAG TGCTGGMGT 
GTGCTMCM ATTCTTTCTC CAWTGCTT ACSATTACAA MAAMCCCT 
rARrATPTCA TGCCAGACTT GAGTTAAGGT TGTTTTCTTT TGTGTG TCAfa 
CTGTATTCTG GTCATGACTT CCTGATGATG CCCTATAGAG ATTTTGCTGA 

Stcagaggg tgctccactg CCATCAGTAG cactgactct tgcagaagca 

SgtSSS Aghggctaa tgtcatccct ocgittctt tgtttcaaat 

TTGTTTTAGT TCCAGAGATA GCACTTTCAT G6AATGACGC jATCTTCTAG 
AATCACTTTT TTTTTTTTTT TGAGTTGGAG TCTCGCTCTG TCGCCAGGCT 
GGAGTGf AGT GGCACAATCT CAGCTCACTG CAATCTCCAC CTTCCGGGTT 
CM™ SicA GCCTCCCGAG OAGCTGTTAC TACAGGCGCA 
CACCCCCACT CCTGGCTAAT TnATGTGTT TTAGTAGAGA C66GGTTTCA 
CCGTGTTGGC CAGGATGGTC TCGATCTCCT GACTTTGTGA JCTGCCTCCT 
TrAGfTTfCC AAAGTGCTGG GATTACAGGT GTGAGTCACC GCGCCTGGCL 
TAGAATCACC TTTTTATACC ATAACGTGAG CACCACTGCC GCGTCACCAA 
WWUTOG AGGCaScTAC TGTGGGGTTA CAAATGGGTA AOAGIGGCAC 
CAGGAAGGTG AAAGTCTCTA CTTAGCCAAG ^MCAAA ATGTCAATCA 
CCAAACATTT ATTTATTAAG CTAC^CAG —A TGAACAA^ 

cffioc tcESstt ATTGTGAGAA iaaaat&aaa T^AGT^A 

AftAr ArTTAn RAAAAAGAAA AGCATTGGTT TTCAATTGTT AGTGTGGATC 

SwtIgg&ctow aaaatgcaga rara&ccc cagtctogc 

G4TTCTEATT CTGTATATCT GAAGTOGSftC TCAGGAATCT TGATTTTCAA 

cSSc Stf ATGCTGCTAT TCCTTTAGTT ACACTTTCAG 

AAATATTACT 6TAAATCAAA TGGCAAGAAT AAAATAGTTA TTTGAGGCAG 
SflG TOeElHi AGTCCAM6A mGGGTCAA ACTC«OTT 
TrTCtPTTCC. TAGACCTGTG ACCTTAAACA GCAACCTTCT CTGTGAACtl 

tSttocctc AGGAACGGCT CTGGTCACCT CCTGCTGTAC TCCATTGATG 

™£c a£ tgggagtccc ccaaaccttt GCTCTCTTAA 

CTCCTTTTAC AGCCTCCTAC ATCTCCTGCA GGTGCTGTCT TCTCCTCCTT 
TTTrrArrrr fTGPTCTGAC ACAGCATTCA TTCTCCTCTG GGAAbbbi iu 
rTTCAATGTG TCTCCAAGCA CATCACACCC AGGAAGGACC CTGTGGCCAT 
SSSK SSa AACTACGTGA AGGCAGGCAC TAGGTACTGT 
ttGTGCCCAG CATAGGCCTG 6CCCATACCA GGTGTCCACA GATCCCTAGT 
AAAGAAACCT ATGATTCAGG ACCCCCATGA TGAGCAACTA JAGCACTAGA 
ATArTPATAA TAAfTAATGT TTATAATGCA TCTTCAGTTT ACAGAGGGCT 
mSct TCATCTAGTT TAGTTCCTGC AACAACCTCT TGAGGAATAT 

TTTATGCTGC TGGGAAGGGC AGCACTGAAA TTAAAAGAAA AGTTTTCTGA 
S" SnWCTTT CCTCAATGTG A^TCTAGCA AGGTATTCAG 
GAATCCTGCC TCTACAGTTC AGAGCCTCAA ATTGCTGGGT ATGnGAGTT 

ga sffssss sskk 

£S* CC™ ACATGAAGTC AAGTCTAACT AGTCACTACT 
ATTTrAfTAr TGCTGACTCC TGATGATCAG CTCCTTTTCT AAGTGCTTAC 
Sera TTOOCHC TGCCTAGAAT TOUGTCAAG OWCAAAK 
AAAAGGATCA TAA6GCTTCC TTTTTCCAGT ATGTTTTTCC TCuililGA 
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20001 AAACTGGGCC 

20051 GCGCCTGGTA 

20101 CCATTCTCCA 

20151 AACATATTTT 

20201 AATCTGAACA 
20251 : CAATTCTCCA 
20301 GACTAAATCT 
20351 TGATACCTAA 
20401 CATGACATCA 
20451 ACACT6ACTG 
20501 TGTTCTCATT 
20551 TTTTATTTAT 
20601 GTGTGCAATG 
20651 AAGCGATTCT 
20701 CACCACCACA 
20751 CATGTTGGCC 
20801 CTCAGCCTCC 
20851 CGGGACCCTT 
20901 TTGGAA6AGG 
20951 GTAATGCTTA 
21001 CTTCCTTGAT 
21051 TTCTTGGGTT 
21101 GGGTGATTCA 
21151 CCGGTTATGA 
21201 CTGTGCCTCA 
21251 TTTCCAACTT 
21301 TCCCTACTAT 
21351 CATAACTGAT 
21401 TTTCTATCCA 
21451 AGTGTGACAT 
21501 GGCTCTGAGG 
21551 TCTGGTGATC 
21601 GTCAAATGGA 
21651 GGAGAAGGCT 
21701 TGTCCTCTCA 
21751 TAGGGGGAGA 
21801 GACTCACTGA 
21851 AGAGGAGTTA 
21901 ATATTCTTCC 
21951 TTAAAAGATA 
22001 CTGATTGAAT 
22051 CTGTGTGGCT 
22101 TTCAGTATCC 
22151 TGTGGCCCTT 
22201 GTTTAAACTC 
22251 CATAGGGGTG 
22301 GGCCCAGACA 
22351 TAGTTGTAAC 
22401 TTAATGTGTC 
22451 CATTCAAACC 



AGTTAGCTAT 

TATAGTAGAT 

GTTTCTCCAA 
CI 1 1 1 1 ICAA 
CATTTCATGT 
TTCCTAGTTT 
CTAAAGTTCT 
GATAGATGCC 
CATTACAGGA 
AGTTCCATGA 
TAATGCTCAT 
TTATTTATTG 
GCATGATCTT 
CTTGCCTCAG 
TCCAGCTAAT 
AGGTTGGTCA 
CAAAGTGCTG 
GTTTTAGAAG 
GGAGGAGTGG 
CCTTTCAGTA 
TGGAGTCCTC 
TTGAATTGTT 
TTCACTTACC 
AGGCTATTGT 
GCTGG7TCTA 
ATGAAATGTG 

AATTGCCAGT 
GAATGTTCTG 
GGACCAGTTT 
TtCATCTCCC 
AACCACACAC 
AATCCTTCAA 
TTCCACCTGG 
GTTCCTCTTC 
CTTATTCCCC 
GGAGAGCGCC 
GCCCGTCGCC 
TGCTCATAGG 
ATTAGTACTG 
ATTTGCTAGT 
TACAGGGGAA 
GGGAGAAAGA 
CGTATGTTAT 
GTGTGTCCCC 
AACTTGAGAC 
AAAAATGTTG 
GAGAGAGTGA 
AATAGTCTTA 
CCAGACTTTG 
CAGACAGTCT 



CTCCATTTTT 

ATGGAACATT 

AGTTACTAAC 
TATATTGGGA 
GACTTGGTAT 
CAAGTTCATG 

ATCCAGATGC 
AAATATTGTC 
GTAGCAGATA 
GCCAGATACT 
AACCCTGTGA 
AGACGGAGTC 
GGCTCACCGC 
CCTCCGCAGT 
TTTGTATTTT 
CGAACACTTG 
GGATTACAGG 
GATGACTGCT 
GGCACGAAAG 
TTTGGAGGCT 
CCAGCCAATA 
TGACCAGAGC 
ACACCTTGCC 
TCTCCAGCCT 
AGGAGTCAGT 
CTGGAGATTA 
CAAAGGATTC 
CCAGCTGCTC 
CCAAGGGTGG 
AGTGATGGGT 
TTGGGTCTGA 
AGGTTCCTCC 
GAGGGGCTTC 
CAGGGGGAGG 
ACCCACCCAC 
TGCAGCCTCC 
GCTGGAACAG 
CTCCCTGGCC 
TGTTCATCAC 
CCCAGACTTA 
CATAATAGAT 
TTGCTCCCAG 
TTCCCCACn 
TCGGCTAGGA 
CCAAGGAAAA 
ATGCTGGGAG 
CTTGCTAAAG 
ATGATATTAA 
TGCCAAGGGC 
GGCTCTGGGC 



ATTTCATGAA 

ACACTTTGGA 

AATGGTTCCA 

AATAATTCTC 

CCTCATATGT 
AACTGTAAAA 

CAAATTCTTT 
TTTTACCTGG 
CTAAACTCTC 
GAAGTGAGCT 
AGCTGGGAAT 
TGGCTCTGTC 
AACCTCCGCC 
AGCTGGGATT 
TAGCAGAGAT 
ACCTCAAGTG 
CATGAGCCAC 
GCTATAATGT 
ATGGTTAGTA 
TCGGAGTCCT 
GAGGGCTTCA 
TTTCTTCCGA 
TGAACATTCA 
GTCACAGACG 
TTGTTCAGCT 
ACACCTCTCC 
CTGCAGTTGC 
TGAGGACCTA 
GAGGGTGAAA 
GGCTTGGGCC 
GCAGCCAGCA 
TGAAGTCTGA 
TGCTTCAACT 
CAGTTTTCAT 
CAAGTCCTTT 
TGCTCACATT 
CAGAGCTGTG 
TCAGTCTCTT 
ATGGAAATCA 
ATTTGGGGCC 
TTTTGGTGAG 
CTCTCCAGCT 
CCAGCCCACC 
TCCTGACCTC 
TAGAGAGCCC 
CTATTTAGAG 
GCCACATAGC 
TGGCTAACAT 
TTACATGCAG 
CCAGGCTGAG 



TACATCCCCA 

GATATTGCAC 

TCACTGTGCC 

CCAGTCTGAA 

CTTGGGCTTC 
CAAAGGATTA 
TCTCTTTCCA 
TGTTTGTGAA 
ACTCTGTAAA 
TGTTCACATA 
TGCTGGGACA 
ACCTAGGCTG 
TCCCGGGTTC 
ACGGGGCACA 
GGAGTTTCTC 
ATCTGCCTGC 
CATGCCTGCC 
AGAAAGTGAT 
GATGGGGGTG 
CAAAAATTCT 
CACAAACAGT 
CAAAAGGTTG 
CTTGGGGCTG 
CTTTGAAGAC 
CCGTGCCAGG 
TGCGATTTTA 
CTCTGGCAGC 
GAAGAGCAGT 
TATATCCTCC 
CTTTGAAGTT 
G CTTATC ACA 
AT7TTTGGAG 
CAGGACATGG 
GGCATTGAGA 
GTAAGAGGAG 
CCTAGACACC 
TGAAATGTCA 
TGTGGCTTGC 
GAGGGTACAA 
CCCTTCTTGC 
AAATAGTTGT 
GGGC AGCCCT 
TCACCTCCTC 
CTGCTCAAGA 
TCTGCAACCT 
ACCTAACCAA 
TAGCCCACAG 
TTATCAACCT 
TGCATTGTCG 
CTTTGGTAtA 
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GCATGGTAGA 
TCACTTCTCA 
CCTGTACCTC 
AATAGTGGGG 
TAATACAGGG 
GGGCTAGAGT. 
TCTGCACAAA 
GTGGTTGAGC 
TGCCTATTTT 
GACAGAGTTT 
AGCTCACCGC 

CCTCTCGAGT 
TTTGTATTTT 
CGAACTTCCA 
GGATTACAGG 
GTAGAGATAG 
TTCAAGCAGT 
GAGCCACTAT 
AAAAAAGCAA 
ATATCAGTGT 
AATAAAAAAA 
TCAGCACTTT 
AAGACCAGCC 
AAATTAGCCG 
GCTGAGACAA 
CAAGATCGCG 
ACACGCACGC 
TGGTGGCCAG 
GATCACTTGA 
CTGCACTTTA 
AAAAAAAAGA 
CATGTCCCTT 
CACAATTGAG 
TTGCTCTCTG 
CATATGTACC 
CATCTGTAGT 
TCCAGAAGGT 
CTCCAGCCTG 
AAAAAAAAAA 
ACATAACCCC 
GTTTCCTCCT 
CACTTAACAC 
CTTAACAGTA 
GTGTCCAGTT 
TCTTTATCAG 
CGGTGACTTC 
AGAGCTGATG 
TTTCCTCCAG 
CAAGGGCTTT 
GGCTGTTCAG 



ACGTTGTCTA 
CATTTACAGC 
AGTTGCTTTA 
GTTAAAATTC 
TGAGCACCTG 
GTGGTGTCTT 
CACCAAGAGC 
TCTGTGGTTC 
AATACGGCCT 
CACTCTTGTT 
MCCTCTGCC 
AGCTGGGATT 
TAGTAGAGAC 
ACCTCAGGTG 
CGTGAGCCAC 
GGTCTTGGTT 
CCTCCCTCCT 
GCCTGGCCTA 
AAGAATGCTT 
CCCAGCCTGG 
AATAAGCCAG 
GGGAGGCCGA 
TGACCAATAT 
AGCATGGTGG 
GAGAATTGCT 
ACACTACACT 
ACGCACACAC 
CACGTGTGGT 
GCTTAGGTGG 
GCCAGGGCAA 
AAAAAATCTT 
AGTTTATGTT 
TGGCCACGAC 
GCCCTTTACA 
AGGTTTGAM 

CCCAGCTACT 
CGAGGTCAAG 
AGTGACAGAG 
CACCCTCACC 
TCAGAACCTA 
TTTACTGGCA 
AGGGCCTAGA 
TTCAAACCCA 
GGTGGAATGG 
ACTTTCCTGC 
TGGCTCTTTA 
TCACTGCCCC 
CAGCCTTGCT 
CTACATGGTA 
GTGGGCTCCC 



TAATGTCTAG 
TGAGTGACCT 
TCTGTAAAGA 
ATTCATACAA 
TTCAGTGCTT 
CGTGGTATAG 
TGTTCTTCAC 
TAGAACAGAG 
GTGATTGATT 
GCCCAGGCTG 
TCCTGGGTTC 
ACAGGCATGT 
AGGGTTTCTC 
ATCTGCCCGC 
CATGACTGGC 
TGTTACCCAG 
TGGCCTCTCG 
TATGACCTGT 
TGTGACATGT 
GCAACAAAGT 
GGCCGGGCGC 
GGCAAGTGGA 
GGTGAAACCC 
CATGCGCCTG 
TGAACCTGGG 
GCAGCCTGGG 
ACACACACAC 
CCCAGGATGC 
TTGAGACTAC 
CAGTGTGAGA 
TCCATAAGTA 
TTATATATGG 
AGTCTGTATG 
GAAAAAGTGC 
CTCAGCCTCA 
CTGGAGGCTG 
ATTGTAGTGA 
AGAGACCCTG 
ACTTATCAGC 
TTTCCTAATC 
ATTTAAACAT 
AGATATTAAC 
TGTGCTCTTA 
GAAAAGGCTC 
CATGGTTCAC 
CAATGGTGAG 

AAATCCAGTA 
TTTTCCTTTA 

GGCTCTGGTT 
ATTCCAGATA 



TCTGGGTTCA 

CAGGCAAGTG 

GAAAAATCAC 

GTAGTGCTGC 

CCTTCTTCTG 

ATAGATAGAT 

TATTAGAGGT 

GCCGGCAAGC 

GAI 1 1 1 1 1 1 1 

GAATGCAATG 

AAGCGATTCT 

GCCACCACGC 

CATGTTGGTC 

CTCAGCCTTC 

CTGATTGACT 

GCTGGTCTCA 

AATGCTGGGA 

GATTTTTAAT 

GGAAATTACA 

GAGACCCTGT 

AGTGGCTCAC 

TCACCTGAGG 

TGTCTGTACT 

TAGTCCCAGC 

AGGCGGAGGT 

CAACAGAGCG 

ACACACACAC 

ACTGGAGGCT 

AATGAACCAT 

CTGAATCTCA 

AATATCTGTT 

CTGCTTTTGC 

GCCTGCAGAG 

CTTGACCTGT 

CAGCTGGGTG 

AGGTGAGAGG 

GCCATGATGG 

ACTCAAAAAA 

TATTTGTCTT 

TGTTAAATGA 

GATGGATAAT 

TGCTCAATAA 

TCACATGCAT 

CCTTGTAACC 

AGTAAGAGAT 

CGGTGTGTGC 

GTGAGATCTG 

CAATCCTGCA 

TGGTCATCGT 

CCTAGGCTTA 
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AATCCTGGCT 
ATTTAACCTC 
AGCACTGTGG 
AAGCAATGTT 
GCTGCCTCTG 
ATGGCTGAGC 
AGTAAACAGA 
TATGGCCCAT 
TTCTTTTTGA 
GCACGAACTC 
CCTGTCTCAG 
CTGGCTAATT 
AGGCTAGTCT 
CAA AGTGCT G 
GAI 1 1 1 1 1 IA 
AACTTCTGGC 
7TATAGGCAT 
GGTTAGGGGA 
TGAAACTCAA 
CTCTACAAAA 
ACCTATAATC 
TCAGGAGTTC 
AAAAACACAA 
TACTTGGGAG 
TGCAGTGAGC 
AGACTCCGAC 
ACGCTGGGTA 
TAGGTAGGAG 
GTTTATACCA 
AAAGAAAAAA 
GGAACATAGC 
CCTATAATGA 
CCTAAGATAT 
GCTCTAGAGC 
TGATGGCACG 
ATCACTTGAG 
CATCACCGCA 
AAAAAAACAA 
GAGAATAGTG 
GGCTGATGAC 
AAATGCTAAG 
ATGGTAGCTT 
TGTTGTCCCT 
CCATCTACCA 
AGAAGCTGCA 
CTGGTAAGGG 
AGTGTTCTGG 
GGCAG6GAGA 
CACAACTGGG 
TCAATCCCTT 
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25001 TTGGCACCCC 
25051 AGCATGGTTA 
25101 CAATGGATGG 
25151 GAGTAACCTG 
25201 CTCCTTAAAG 
25251 GAGTCACATG 
25301 TTGACCTTGA 
25351 GTAGAGTGGG 
25401 GTTTGCCACA 
25451 TTCCTTATTT 
25501 TTGGTCAGCT 
25551 TGGGAGCCCT 
25601 GCTCTCTGGA 
25651 CTTCCTGGAC 
25701 GAATTACTGT 
25751 ATATGTTCAC 
25801 GCTCAATTGG 
25851 CAGCCCAGCT 
25901 CAACCACACA 
25951 CCTCATTCAG 
26001 TCTTTCTTAG 
26051 AGCCTCCACA 
26101 TTCTGAGAAC 
26151 TGCTGGGGGC 
26201 AGGGAGAGGT 
26251 CCCCGGGCCT 
26301 ACGGATTTGC 
26351 TTACTTGTCA 

26401 TT6GTCTCCG 

26451 GAGGGTGGTC 
26501 CTGCCGGGCA 
26551 GAGGTACCCC 
26601 AAGGAACTTC 
26651 AAGGAGCCTC 
26701 TTCCGGGGGG 
26751 TTACCTGCGG 
26801 TGGGCAGGTT 
26851 GCCTGGTACA 
26901 CACATTTGCC 
26951 TTGCTGGGGA 
27001 TTAGAGTGTT 
27051 AGGCCTAAGT 
27101 ATGTGGAAAC 
27151 CACCTCCCAT 
27201 TGAATGGAGC 
27251 GCAGCCTGTT 
27301 CGACTTTTCC 
27351 CTGCCGGGAT 
27401 CCATGTGGTT 
27451 CTGGACAGGG 



AGGCCTTTTT 
TCACAGGACA 
TGTTCTGCAT 
GGCTCCATCC 
AGCCAGAATG 
AGCCTACATT 
GTGAGTTACC 
AATAATTCCT 
TGGAGACACA 
GCAACACAGC 
CTTGAGGCTG 
AGTGCCAGGA 
CCTGTCTCTC 
CTGGCATCCT 
CCTGTAGGCA 
ACTCTATCCT 
TCCCAGTTAT 
CCACCCCTTG 
CCTCGGTTCT 
GGGTGGGACT 
CTAGTCACCG 
GAGAGGTCGT 
TAGGGAGGAG 
CAAGTGGCCT 
GAGCTCAGGG 
GTCATGGTGG 
AGCTGAGCCT 
GTCCCGGCTT 

AGTTCCCCTC 
TCCCTAAATC 

GAAGCCAGCG 
CATTATGTCC 
TTAGGGAGCT 
CAAAAGATGC 
AATAGCCCAA 
GCCATGCTGC 
TTGCCAAACT 
CAGTAGGCAC 
TCTGGCCTTG 
AAGACCTGGG 
CCTGAGCTGC 
ACCTCCACGA 
TCTACCTCTA 
AGAAACTCCC 
CATTCCAGGC 
GTTCCAAAAA 
CTTCTGGCTT 
ACTAGTCAGG 
TTGTGGAATG 
GGAAGGGGGA 



CTCCCTCATG 
AGTAGAAGAA 
GTGAACACTC 
TATTTGCAGA 
AAGCCTGGTA 
TAAATTCCAG 
TAATCTCTCT 
GTCTCAGAGA 
TCAGGTGTAG 
CCTGCCCTGG 
TCCCCAGGAC 
CAGAACAGAT 
CTACCAGAGG 
CTGCTTTTTT 
GCTCCTCTGC 
GCCTTGCCCT 
TGTCTGCAGC 
CCTGCAAGGT 
GCGGGAGCCC 
GAAGAAGAAG 
GCCCCTGCTC 
TTTCTCGGAG 
CCATCCCAGC 
GGAGTCCTCA 
CAGCCTGCCT 
CCATCTACAG 
GTCTATCTGG 

actfcacctc 

TCCATCTCTC 
TCCTTCTCAC 
GAGGTTATAC 
TGGAAGTGGT 
CCAGCTCCCC 
CACTGACCTG 
ATAGAGTGCT 
CTGCCCAGGA 
GTGGAAACTG 
CTTATAAACG 
AAGGGCTTCT 
CGAGTGCTTC 
TGGGCCAGCC 
GCCTCTCTCT 
ACCTGGCTTT 
CAGGGGGTTT 
TAGGGTGGGG 
GGCTGCCTCC 
CTCTAAGCTA 
TGGCCAGGCC 
ACCGGACCCT 
AGGGMCTGG 



CCCCATTTTT 
GCTCCACTGT 
AGTGAATAGT 
GAGCTTTGGA 
GTGGGAGAGC 
CCCTGCCACT 
GTACCTCACT 
AATAAAAGAG 
GTTAATACTC 
AGTGGAAGTG 
AGGCAGAGGG 
GGCAGCTCAG 
TCCCCCCGTC 
1 1 1 1 1 1 ICCA 
TTGAGGACAT 
TCCCTGAGCT 
GCCTGCCTGC 
CTGTTTCCTA 
CTCCTCHCC 
GCTAACTTGA 
AAGAATGCCA 
TCCAGAGGGG 
CATGAGCCCC 
GGCTCCCGCA 
GCAGCCAGAG 
CCGGCCTGAG 

TGTGGGAAGA 
CAGAGACCTG 

CTGGCCCCTG 

7TAGTCCTTT 

CCAAGGAGAA 
GAGGGGAGGG 

TTCTATCCCA 
CCCATTGTAG 
GTTTCCAGCT 
ATTTGTCCCA 
GCAAGTCCTG 
TTTGTTCTCT 
GAGCTCCCAG 
TAAGACTGGA 
CCCACACCTC 
GTGGGGCTTC 
CTTTGCTCAT 
CTGGCCCTCT 
HIGH MCA 
CCCTCACCAG 
GGTCCAGTGC 
CTGGGCAGAA 
GGTAGATTGC 
TCCTCAATGC 



CAGTTTGAAA 
CCACTGAGGC 
GAGTGAATGA 
AAAGATTTTT 
TCCAGCTCTA 
GACTCCCTTT 
TTTCTTGTCT 
TGCATATAGT 
TGGGCCTTGT 
GCACCTCCCA 
AGGGAATGAA 
AGCTAGGATG 
TGGTGTGGCT 
CCTCCAAGCA 
CTGGGGCCAG 
CAGGATGGAC 
AGCCTCGATC 
ACAGCTGCTC 
TCCCTCCCTC 
CAGCAGCGCT 
GTGTGTGTGT 
CCGCCTGAGC 
TGTGGGAATC 
GCTGCTCCGG 
GTGCGGGGAG 
GCAGTCACAG 

AGATGGGGAG 
TTTCGGTGAG 

GTCCTGAGAG 
ACCATCGGTT 
TCGGCCTTGT 
ATATACCCAG 

GACAAACCTG 
ATGTTACTGC 
CTCACATGTC 
ACAAGCAGGA 
GGTGTGGGTA 
TAATGGCAGG 
GTGAATGTAG 
GCAATGGGCT 
CTCAGTCCCT 
TCAGAGGGAG 
TGCCCCACTC 
GGGTCCCTTC 
TTCTTTGGGA 
TGGTCCTGGT 
CCAGATCTTG 
AAGCAGTGTA 
TGGGAAGTGT 
TGACTCTACC 
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27501 AAGCGCCCTG CTAGACACH TATCCTTTAA TCTCTCAACA GCCTAAAGAG 

27551 ATTATATATC CCCATTTTAC AG ATGAGG CA ACCAGTTTCA ACAGAGTTAA 

27601 CATATGGAGC CTCACTGGGC AGCTTTTTCT GTCTTCCTGA CTTTCTCTCA 
27651 TCCTTCAGGG GGCTGCAGGT TTGTTTTCTT CTCCTAGTGG AGAGGAAATT 
27701 CTCAGGTTTG TTTTCCTCTC CTAGCAGAGA GTAAAAAAAG GGATAGTTTG 
27751 CCTGACTTGT TGAAGGTGTG GCTGAGATTG TTTTCTAAAG AGCCAATGGA 
27801 AATTGATCTT GAGTTTAGGA GAAAGCTTTT ACATGTGGAA TTAAGATGCC 
27851 AAGTGTTGAA GTAGCCACAT TTCAGGTCCT CATTAATTTC TCTTAATCCT 
27901 GGGAAGGCAG CTTAGGAGAA GGGTTGTTCC TTTAGGAGCC AGGAACTATA 
27951 CCCCTTTTAC CCTTGGAGAG GCAGGGAAGC CAGGGAGGAC ACAACTTCTC 
28001 AGGAAGAGGA GAAGCTAGAG CAGATAGTGA ACTCTCAACC TGAACCTTTA 
28051 AGGGCCAGAC CACTAATGCC ACCCAAGTCC ACCTGCCGTT TGTCTTGTTC 
28101 TGTCCCAGGC TTTCTGGAGA ACCTGATCTT CTTGCCCCTA CCCCCAAGCT 
28151 CCGTTTGCCC AGCTAGAGTC TGGGGGGTAC TGACTGACTT TCGTAGACAT 
28201 TCTTCCCnC CCCAAATAAG AGGCCACATT CCTGAAGTCA CTTCTGAAGA 
28251 GATAGCTGCC ACACAGGGCT CTTTCCCCCC AGGGAGGGAC CACCCAGACC 
28301 CTCTGCTCTC CCAGGTATCC GTTACCACAT CACTACCTGG TCAGAAAGCT 
28351 GTTTCTGCCA TTAGCCCCTC CCTCTTTTAT TATAGGATAT CCTCAAGGGC 
28401 TCCTCTTTGG GCCTCAGTTT CATCCTTGGC AGAAAGTAGA AGCTAGACTT 
28451 CTTGGGCTCC TGAACAGGGT CCTTGCTGGA TTCTGTGAAA CAAATTAAGT 
28501 TCTTGACCCT AGGCCTCTGG GGGAGTACAA AGTtTATGGG AGTTCTGGGG 
28551 CTGTGGTTGC AAGGAAAGTG ACGCAACCAG ATTCCATGGG GACATGATCA 
28601 GGCGTGACAT GTGAGGGAGG AAGAGGGAGC AAGGGAATGA AGAATACAAC 
28651 TTCTGTGTCC CATACACCCC . TGCCTGACAG GCCATACATA CTCAGCAGAG 
28701 AATGCACTGT CTTTCCTACC ACACTAGCGT GAGGAGTGAG CTGCAATTAC 
28751 CACTGTGCTT CCAAGTAAGA AAATACCTCA AATTGGAATT TACAAA AGAG 
28801 GTAAATTAGG GAGTGGCTTT TGTCGGACAT CTTTAAAGCA TTTTTCTTTT 
28851 TATAGAATTT CACTTAATGT CCAATACTGA TTTAATGAGC TTGGGTTTAC 
28901 ACATTATCTC TTGAAGAAAA CAAATGAACC TTTGTGTTCC AAAGCAATCC 
28951 ATGTTTAAAG GGAAAAAATT ATGCATAACT CTGCCCAGCT TCACAGTAAC 
29001 CTTTGGCAGG TGCCTTAGGT CCTCTGGGAC TCTTTTCCTT ATCTGAAAAA 
29051 TGAAGGACTT GGATCAGGTG AATGGTTCCC AGCTCTGCAA CTTATGTGGC 
29101 TCCTCAGAGG CACACAAGCT CTTTTCCATT ATTTGCCAAA TAATGGAGGC 
29151 CCTGTCTTTA ACTGCAGTAC AACTACACAA AATACTTGAA ACTACAGTCT 
29201 TCCTGGTTTT TGGTTGGAAC TGAATCAGTG CACTCTAGCA ACACTTATTT 
29251 CTTGCTGTTC GTAGGCTTCA TTATGTGTTT GGTTMTTTT TTAAAACAAC 
29301 AATAACATAT TCCATAATAA TTACAGCTTA ATTGG CAGAC TGTTTCAGTC 
29351 TATAGGATCT GCAGGAAGGA GGAGTAATAA AGGGATTTTT GACTGAGCTC 
29401 TTATGGAACA GAGTCTCTCT AGGCCCCTGT CATATCTGCC CTTCTGGGCC 
29451 CTGGGGAAAA GTTGGCATCC CCAGTTGTGG TGCTCTCCAG GTGCGCTCAG 
29501 GCTGTGGTGG AGGGAGCTTC CCATTCTCTC CTTCAGCCCA CTCAATTCAG 
29551 AGGCTAGGGG CTGAAAGAAG CTTCTCTACA ACTGGCTGTT CACTGGGAGG 
29601 TTAAGGGATG ACCATCCAGC CAGGCCTTCC TCAGGACATG GGAGGGCTTA 
29651 TGCTTTAACA TGTGTAAATC CACTGCAATA ATGACTGGTT CTTTTACCCC 
29701 ATAAGGTTGA GAATTTACCT GTAAACAT7T TTGTCTGAAG AATTTGGATG 
29751 TAAGTGAGGG CTGGGCCTCT ATCTTATCTC ACTTGGCTTC TCTCAGCACA 
29801 GCACCTTGCC TGCTTGTTCT TACACATCCT AGATGCACAG TAACTATTTC 
29851 CTAATTATTA GAAATCTA7T AGAATCAATT GATTTCAGCT GGGCTTGGTG 
29901 GCTCCTTCCT GTAATCCCAG CACT7TGGGA GGCTAAGGCT GGAGGATCAC 
29951 CTGAGTCCAG GAGTTTAAGA CCAGCCTGGG CAACATAGGG AGACCCTGTC 
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TCTACAAAAA 
CCAGCTACTC 
AGACTACAGT 
AGTAAGACTC 
GGATAGATAA 
CTGTTTCATT 
AAAAGGGGTC 
AAGAAATGAA 
ACATCTGTAC 

TGTACAAGTG 
GTGACACAGA 

AGAGAAAATC 
CTGACTCAAA 
ATCAGGGTTC 
TAATGTTTTG 
CGTATGTAGG 
TTCACTCTGG 
TACCACCTCC 
ACTTTCACTT 
TTTGGGAAGA 
ATGTTCGAGT 
TAAAAGAACA 
AGGAGAATTG 
CACGAAGTGT 
CTTATTTTCA 
CTCACTCCGG 
ATAGCTTCAC 
ATGATTTGAC 
CCTGCTACCG 
CATATAGGAC 
TGATAGCTAA 
TCTGAGAACC 
CCCAGCCCTC 
GGCAGTGACT 
GACACAGATA 
CAGCAGGCTG 
TGCTCCCTCT 
GATATTGGCA 
GCGCACATAC 
GCAAATAAGT 
ATGGAACCCT 
CCATTTTCAG 
TGCTTGGTGC 
AGGGAGCTAA 
MGTCAGAAG 
CTCTGTTGGG 
AGGCCAGGCA 
AGGCGGGCAG 
TGGGGAAACC 
GCATGCGCCA 



ATAAAAAATT 
AGGAGGCTGA 
GAGCAATGAT 
TGTCTCTTAA 
ATAATTCATT 
CTGGGCTGAG 
TGTCTGCATT 
CACTTACTGG 
ATAATTTAAT 
AGGAAACTGA 
TGGTAAATGG 
TCAACTGATT 
AGGTCTCTGT 
CATGCACTTT 
GAAGATGAGT 
CCTCTAGGTG 
CTCCTCAGGA 
CAGCTGGGTA 
GAAACTTCCA 
ACGTATGATG 
AGCACCTTAC 
GAGACTGCTG 
TGCTATTTAA 
TAACACTCAA 
CAGGTGAGM 
TGACATAGTA 
ATTCCAGCTC 
TTTAACTCTG 
TAGGGTTGTC 
CTGGATTATG 
GCTAGAACTC 
CAGTTGTGTT 
CTGTCAGCTG 
TTGGCCACAT 
TGTGGACTGG 
GCCTGGCTGT 
CCTGGGCCTG 
ATGGAAAGGA 
CATTGCAAGG 
CTGCACACAG 
TGtGCTCCCC 
TTTGGAAATA 
TGAGTCTACC 
CAGTCTAGTG 
GCAGGAGGCA 
CACCCCATAG 
TAGTGGCTCA 
ATCACTTGAG 
CCGTCTCTAC 
GTAATCCCAG 



AGCCAGGCAT 
GGCAGGAGGA 
TGTGCCACTG 
AAAAAAAAAA 
TTAGGACCTT 
AAGCAGGTCC. 
TGCCCTTGGT 
CTACCTTCTG 
TCTCATAACC 
GGCTCAGAGT 
CAGAGAAGGA 
ATCTTTTTTA 
GTGGATCTGG 
GTATCTGCCC 
TTTGGAGGTT 
ATCTCCCCTA 
CAGTGGGATG 
CTCTTCTACC 
AAAATTGAAA 
TCCATGGCCT 
AGTTCCAAAG 
GGGAATTGAA 
CATTCAGTAC 
AGGGTCTTGA 
TCCTGAGGCT 
CAGTGGATGT 
CATCAA7TAT 
CTTTTCAGTC 
AGGATTAGAG 
GCTGGCATTC 
TGAAGTCTAC 
CTGTGGCAAA 
TTCACCTTCC 
AGCTGGCTGT 
TGACGTTGCT 
CTCCTGCATG 
GCCAGAGCTA 
GGGTGTGTTC 
GCGTAACAGA 
AAGAAAAGAA 
TACCTGGGCT 
TTTGTTAAGG 
AAGAGTAAGT 
AAGAAGAAAG 
AGAAGGAAGC 
TTCTTCAGAA 
CACCTGTAAT 
GTCGGGAGTT 
TAAAAATAGA 
CTACTCAGGA 



GGTGGTGTGC 
TCTCTTGAGC 
CACTCCAGCC 
AAAAAAGTTG 
TCTTTTTCAC 
ATATTGCtAG 
GGTCTCAAAT 
TGAGCCAGGC 
CCATAAGATA 
CATGAAGTM 
ATATGGATCC 
AAAAACTCAT 
GTTGACCCAC 
AAGCCCTCAG 
GTCCTTAGGC 
ACCTGAGGAT 
ACTGGTTCAG 
TACAGCCAGG 
GGTAGAAAAA 
CTAAGCATCT 
TGTGTTCTGG 
CACTGTGAAG 
TTGGGCTAAA 
GCTGTCAGGG 
CAGCTGTTGA 
GGCTTTGCAG 
GTATTGGGCA 
TTCTGTAAAA 
ATAATATAAA 
AATAAATAGT 
CATGGCAACT 
ACACAGCTTA 
AGTTCTTCAG 
GCCCTTTAAA 
CTCCAGCCAG 
CCTGTACTTG 
CTTGCAGCAA 
TGGTGCTCCC 
GCCCAGGCCT 
GGACCTGGTG 
ACTGGTTCTT 
CTTTGCTCTT 
GGGATGCTGT 
ATGGTTGCCC 
CCCTGCTCCT 
CCACATTTAA 
CGCAGCACTT 
CGAGACCAGC 
AAAATTAGCC 
GGCTGAGGTG 



ACCTGTAGTC 
CTGGGAGGTC 
TGGGTGACAG 
ATTTCTATTT 
TTACAGAAAT 
GCATAGGAGA 

TGGGGAGGGA 
ATCATGCAAG 
TTATTAGCAA 
CTGGCCTTGG 
AGGTCTTGAA 
ATGTTCTCTG 
TGAACTGACC 
AACCCCTCAG 
ATAGCCTCAG 
TTCAGCTCAA 
ACCTCAGCTT 
GCAGATTTTG 
CAGCCTTGGC 
GAGGTGGGAC 
GTTCTTTGTT 
TATATGAAGG 
GGAGAAGCAT 
CTCCAGCHC 
GATGTGCTGT 
CCAAGCACAC 
GCTTTGCAGA 
CAGGGATAAT 
TAAGGTACCT 
AGCTGTTAAT 
TCTTAAGTGG 
GGGATCCATA 
AGACATGTGT 
GGCATTCCTT 
GTGTTCTTCC 
TTTGTCTCCC 
ACAAAAGCAG 
ATGCCCTGCG 
GCATTTGGGT 
ACCAGGAGCC 
GCCACTCCTA 
GCAGGTCCTT 
TTTTGTCCTC 
AGGAACTTCT 
ACTGCCAGCC 
TCCTCACTGC 
CGGGAGGCCA 
CTCACCAACA 
GGGTGTGGTG 
GGAAAATCAC 
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32501 TTGAACTCGG GAAGCA6AGG TTGCAGTGAG CCGAGATTGT GCCACTGCAC 
32551 TCCAGCCTGG GCGATAAGAG CAAAATTCCA TCTCAAAAAA AAAAAGAAAA 
32601 AAGAAAAAAT CCTCACTGCT ACCTTGAAAG TAGGTGATGA CATTGCCATT 
32651 TCACAAATGA GAAGTGAAGG GGCTAGCCCA AGATCACTTA GGTGGTAAAT 
32701 GGTGGTGCTA AGATTAGAAC CTCAGATCAT CTAGGGAAAA ACACAGATAT 

32751 GCACAGAGTT AAGGGGACCC AGGGTATTGT TTGTCCTCTT GTTTCACAGG 
32801 TGGGGAAACA ACCCAGAGAG GGAAAGGGGC TTGTCCAAGG CAATTTAGCA 
32851 CCCAAGAACT TGAACCCATA TCTCTCTCCT CCTCATTTAG AGCTCATCCC 
32901 ACATGTATCT TATATTGAGA GGAGTGTGAG CCACATACCA AGAACAGTCT 
32951 TCCCCTCTGC CTCCAACCTC ACTGTGCAGT TTTGAGACAC TTCACAGCCA 
33001 TACTCTTCAT GCCATACCCA GCCCTTAAGA CCCTGAAGTT CCCCTTCCAT 
33051 AAGACAAGTA GGAAAAGCTA TAGGGTAAAA ATAGCCATCA GTGTTTGTTG 
33101 AGCACCCAGG AGGAATTGGG CACTCCAGAA AGATAAAGGG ATTCTCAGGG 
33151 ACTTGCTTCT CTAGACTTCC CTAGCTCAGC TGC7TCAACT CATTCCTGCC 
33201 CCTCTTCTCT ACCTCCCGCA GTGCTCAGAA GTAGTAGAAC TCACTGTGGC 
33251 CTCTCACCTT GCATTG1TGA G7TTTATTTA GACTTTCTCT TCCTCAACTC 
33301 TTCATAAGCT CATGAAAGGT GAAGTAGGGT GCCCTGTGTA TTTATCTTTT 
33351 ATATCTGCAG TGCTTAGCAA GTTATMTAA TGCACTTGCC TGGCAAAAGG 
33401 CTTTCTCTCA TACATTAGCT TATTTCCTCT TCACATTGGC TCTTTGTAGT 
33451 AATAGGATGC TATTAGTTAT TTTCAATGAG AGAAAGCTAC TAAGAGAAGT 
33501 TGTCCAGCTA GTGACAGTAA GTGGCTGATA AAGTGAGCTG CCATTACATT 
33551 GTCATCATCT TTAATAGAAG TTAACACATA CTGAGTTTCT ACTATATTGG 
33601 GTC II1I I II llllllllll IIIIIIIIIA GAGACGGAAT CTTGCTCTGT 
33651 TGTCCAGGCT GGAACGCAGT GGTGCAATTT TGGGTCACCA CAACCTCCGC 
33701 TTCCCAGGTT CAAGCGATTC TCCTGCCTCA GCCTCCTGAG TAGCTGGGAC 
33751 TACCAGTGCA CGCCACCACG CCCGGCTAAT TTTTGTATTT TTAGTAGAGA 
33801 CAGGGTtTCA CCATGTTGGC CAGGCTGGTC TTGAACTCCT GACCTTGTGA 
33851 TCTGCCCGCC TCAGCCTCCC AAAGTGCTGG GATTACAGGT GTGAGCCACC 
33901 GCGCCCTGCC TATATTAGGA CTTTTATATA AGCTATCTCT AGCTAGCTAG 
33951 CTAGCTAGCT ATAATGTTTT TTGAGACAGA GTCTGACTCT GTCACCCAGG 
34001 CTGGAGTGCA GTGGCGTGAT CTCGACTCAC TGCAACCTCC ACCTCCTGGG 
34051 TTCCAGTGAT TCTCCTGCCT CAGCCTCCCG AGTAG CTGGG ATTATAGGTG 
34101 CATGCCACCA CGCCCAGCTA ATTTTTTGTA TTTTTAGTAG ACCAGGTTTC 
34151 ACCATGTTGG CCAGGCTGGT CTCGAACTCC TGACTTCAAG TGATCCACCC 
34201 GCCTCGGCCT CCCAAAGTGC TGGGATTATA AGCATAAGCC ACTGTGCCCA 
34251 GCTGCTCTCT ATATTTTTAA TACATATTAT TTCCATTAAT TTTCACAGCA 
34301 GTTCATTTTA TAGATGAGGA AACTAGGCCA GAGAAGTAAA ATATCTTGCC 
34351 CAAGATGATG TAACTAGTAA GTGGCAGGAT CAAGATTCAA ACCAAGCAAT 
34401 GTTCAAACCT CTTGGAAGCA AGAATGTGGC CACTGTGGAA GGTGCAAGGC 
34451 CTTGACAACA AGAATAGGGA AAAGAAGGAA CTAGAAGGAA AGAGATGGCA 
34501 TGGGCTCAGC AGGCCAGGGA GCTCTTAGCT GTGTGTGTTG GGAAGCTCAG 
34551 AAGGGAGGAA GAGGTTGTCT GTGCAGGTAA GTCCTGAGAA CACACCAGAC 
34601 TTTTGAGAGG TGGAGCTTCA TAGCCAGGTC A TTAGGGG AG AAGGGAGCTA 
34651 TAGATTTTTT llllllllll llllllllll 1 1 1 1 1 1 1 IAG AGACGGGGTC 
34701 TTACTATGTT GCCCAGGCTG GTCTTGAACT CCTGGGCTCA AGTGATCCTC 
34751 CCACCTCAGC CTCCCAAAGT GCTGGGATTA GAGGCATCAG CCACCCCGCC 
34801 CAGCGAGCTA TGGATCTAAC ATGTACATCT TACACAGTGC TAATAGAATG 
34851 TTGGGTTTCT TCCCCAATAT TTTATTTTGA AAAAAAATTC AAATATATAG 
34901 AAAAGTTGAA AAATGTAGTT CAAAGAACAC CTACATACCT TTCACATAGA 
34951 TTCATGATTT G7TAATGTTA TGCCACTTTG TATATATCTC TCTCCCTCCT 
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35001 ATCTGTATAC TTTTATTTAT TTATTTTTGC TGAACTATTT CAGAGTAACT 
35051 TAAAGGCATC TTGATTTTAC CCTTGAACAG TTCAATATGT TTCTGCTAAG 
35101 MTTCTCCTA TATAAGTCAG ATATCATTAC ATCTAAGAAA ATTCACGGCA 
35151 ATTTTACAAT ATAATATTAT AGT CCAAAT C CATATTTCCT CAGTTGTTCC 
35201 AAAAAATGTT CATGGCTGTT TCCI 1 1 1 1 1 A ATCTA AATTT GAATCCAAGT 
35251 TTGAGGCATT GTATTTGGTT GCTGTGTCTC TAGGGTTTTT AAAATCTGTG 
35301 CCTTTTCTTC TCCCCATGAC TTTTTAGAAG AGTCAAGACC GGTTATTCTT 
35351 ATAGAATAAC CCACATTCTA GATTTGCCTG ATTAGTTTTT TTATACTTAA 
35401 CGTATTTTTG GCAAGAACAT TACATTGGTA ACGCTGTTGG TGATGGGTCA 
35451 GTTTTGAAGA GTGGAGATGA TTAAACTGCT TTTGTTCATT GAAGTATCTG 
35501 TCAAGACCAG AGATCCTTAA CTGGTGCCAT AAATAGGTTT CAGAGAATCC 
35551 TTTATATATA CACCCTGTCC CCCACCTAAA TTATATACAC ATCTTCTTTA 
35601 TATATTCATT TTTCTAGGGG AGGCTTCTTG GCTTTTATCA AATTCTCAGA 
35651 GGGCCCCAAG ACCCAAAGAG GTTATGAAAC ACTAGTCTGT CCACTGAGGC 
35701 AGGCAACACA GAGCTGGTTT CTGGGGCCTT GTTCAGTCTG AACCAGCTTC 
35751 CCTTGGGGAG ATAGCACAAG GCTGTAACTT TGCCCCATCT TGGCTTTGGA 
35801 TCAAAGAGGA CTGTCCATTT TGTTGTCATA CCTAGGAACC AGGGACAGCT 
35851 TATGTGGCCT GGTTCCAGGG ATCCAGGAGA ATTTCAGTTC TTGTCTTGCC 
35901 TTTCAGGTGT TCAGAATGCC AGGATTCCCT CACCAACTGG TACTATGAGA 
35951 AGGATGGGAA GCTCTACTGC CCCAAGGACT ACTG GGGGAA GTTTGGGGAG 
36001 TTCTGTCATG GGTGCTCCCT GCTGATGACA GGGCCTTTTA TGGTGAGTGA 
36051 ATCCCTTCAT ATCTGCCCCT CTTGGTCTTC AGAGTCCATT GACAGTGCTT 
36101 CCAGTTCCCT GTGGCCTGTT AATCTT7TAG TCTTTCCATC AGCCAGGGCA 
36151 TCTCCCTTTA TTTATTCATT CATTCAACTA GCAGGTATCA ATTGAGCACC 
36201 TACTAAGTGA AAGGTAAGAT CCTTCCCTCA AAGACTTAAT AGTTGAACGT 
36251 TGGGAGTGGG AGGAGAGGCA GGCAGAGAGG AGACACAATA TAGTTGGATA 
36301 AGGACCTCCA AGGAGAGTGT TACAGGCTGA GAGGAGGATA TACTTAGGTT 
36351 GTCTTTAGGG AATCAGAAAA GGAGACTCTG GAATAGGCTG GCAGAGAGAG 
36401 GGGCTACCTC CTATACCTGC TCTGGACAAA CG ACTTTAA G CATAGTGACA 
36451 GATTTGCCAA CCCTGTATTG GAAGAACTGA TCI I II MAG TGGGGATGAT 
36501 TACTTCTGGG GATTTCTTCT CATAACTGAG ACCAAAACAG TTTTGTGCAG 
36551 TCTCAGAAAT GACAGGAGGT ACCAATCTGA CACTTCCTTT GGAAGCTCTA 
36601 GGGCAGAGAG TGAAAGAGTG GATTTTGACG GGGGCCTTGC TTGGAGGTCA 
36651 TTCACCCACC CCTGTCCTCA CTCCAGCAAC AGTGATAACT CACTTCCTTC 
36701 CTCCCTTTGT ACACCCTTCT CCCCACCTGC TCACAGGTGG CTGGGGAGTT 
36751 CAAGTACCAC CCAGAGTGCT TTGCCTGTAT GAGCTGCAAG GTGATCATTG 
36801 AGGATGGGGA TGCATATGCA CTGGTGCAGC ATGCCACCCT CTACTGGTAA 
36851 GATAGTGGTC CTTTGTCTAT CCTCTCCCAT ATAAGAGTGG CTGGCGGGGA 
36901 GGGACAGTGG CAGGGTGAGT TGGGCAGAAG GAGTGTTAGG GTAGTCAGAG 
36951 CATTGGATTC TTACCACAGC AGTGCTCTTA ACCAGCTCTT. TAACTTGTAA 
37001 GCAGAATGAT TTACACATGT CTCTACCCTT TTTCCTTACC AACCTTGAAA 
37051 ATGTCTTCAC TCTGCCCTGC AATCCTCCCA GTGGGAGGCA CTCTTCAAGG 
37101 ACGATCCCAG AACATTAAAG TCAAAGACCC GTTAGAGCTC ACCCTGTCCA 
37151 ACCACCTTGG TTGATAAAAG AAGTCAGCCT GGGGCCCATG GAATAGAATA 
37201 GTACAAGGGC AAGGTTCTCA TTGTGAGTCA AAGGTA GAGT GAAGAGAACC 
37251 CAGACCATCT CACCCCAACC CAGGCCAGTG TTTTTCCAAA TATACCACTT 
37301 GCTGCAGATC TAGCTCAGCA CCCCCAGTCC CAGCCCACCC TGAGAACCCA 
37351 GGCTCCTCAT TCTGAGCAGC CAGCTAGAAT CATGACAAAG AGGGTGGTAG 
37401 TGAGACTATG GGTACTGTTG CTTAAAGCCA CATGGTGCAG TGGTTGCTGG 
37451 GGGGCTTCTG TGTGGGACTC TAGCATCTTA TTCCCCCCTG TGCCCTCTCC 
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37501 CCAGTGGGAA GTGCCACAAT GAGGTGGTGC TGGCACCCAT GTTTGAGAGA 
37551 CTCTCCACAG AGTCTGTTCA GGAGCAGCTG CCCTACTCTG TCACGCTCAT 
37601 CTCCATGCCG GCCACCACTG AAGGCAGGCG GGGCTTCTCC GTGTCCGTGG 
37651 AGAGTGCCTG CTCCAACTAC GCCACCACTG TGCAAGTGAA AGAGTAAGTA 
37701 TTTTGAGAAC CCTTCAGCAG GGGTTCTTGA GCAGAGTCTG TAAATGGGCC 
37751 TCAGAGGGCT TAGACCTCCA AAGTCTCATG CAGAACTCCC TTTATTCTCA 
37801 TCTCATATCT TTCTCCTGGA CCCCACTATG CTGTAACCGT ACCTGGGCCT 
37851 TGGCACTTAC TGTTCTCTCT GCCCAGGCTA CTTCCTACCC GATACTTAAG 
37901 GCAAGAATCA CTCACCTTTC AGGTGTCAGG TTTCAGGTCA TGTTTGCTCT 
37951 TTGAAATCAT CTGGCTTGAT TATGTGTATT AGTTGTTTAT CTTCTATCCC 
38001 CTCCACTAGA ATGTAAATTC CAGAAGAAAC TTGCTGTCTT ATTCAGTGCT 
38051 GCATGCCCAG GGCTTGGAAG AGTACCTGGC ATATAGTAGG AGTTGATTGA 
38101 nATTATTTT GTCAGTCGAG AGAATGAATG GAGAAAATGT GGTCCATGGC 
38151 CCAAAAGAAG TTAAGACCCT ATCCTAGATT CAGGCCAGAG ACCAGATGGA 
38201 GAAAGAGTCT GTGTCTATCT AATACCAGTA ATGTCGTACC TCTGGCCGCT 
38251 TACCATGTAA ATATTGATTG TGTATCTACC ATGTGTTGGA CACTAGGCTA 
38301 GTGCTTGCAC AGCAGGTGAA AGATACTAGA GTTTGGGAAG TCAGGAGGAG 
38351 CTAAGGTCTG TTCTACAACC TTATTAGATG AAGAGGAGAG GGAATTGTGT 
38401 TCAGGGCAGA GGGAGAAGCA TTTCTCCAAA AGTAGGAGTC TTAATCATGT 
38451 CTGATGTAGG TTGAGTGTGG CCAGAAAAGG GGCTGTTAAG TATAGAGGGC 
38501 CTGGATTATG AAAATCCAGC AGATCCATTG AGAGTTTAAG CAGCAAGGTG 
38551 TTGTGACCAA GTTAACATTT TAGAAGGATC ACTGGTATGG AGGTTGGATT 
38601 GGAGAGGGGA AAGCCTAAAG GTATAGAGAC TAGTTAGGAA GCTATTGTAG 
38651 GCTGGGCATG GTGGTTCATG CCTGTAATCT CAGCACTTTG GGAGGCTGAG 
38701 GTGGGAGGAT TGCTTGAGGC CAGGAGTTGA AGACCAACCT GGCCAACATA 
38751 GCAAGACCCC GTCTCTGTTT TTCTTAATTA AAAGAAAAGT CCAGACGTAG 
38801 ACATAGTGGC TCACGCCTGT AATGCCAGCA CTTTGGGAGG CCAAGGTGGG 
38851 CAGATTGCTT GAGGTCAAGA GTTTGGGATT AGGCCAGGCG CAGTGGCTCA 
38901 CGCCTGTAAT CCCAGCACTT TGGGAGGCCG AGGTGGGCGG ATCACAAGGT 
38951 CAGGAGATCA AGACCATCCT GGCTAACACA ATGAAACCCC GTCTCTACTA 
39001 AAAGTACAAA AATTAGCCGG GCATGGTGGC GGACGCCTGT AGTCCCAGCT 
39051 ACTCGGGAGG CTGAGGCAGG AGAATGGCGT GAACCTAGGA GGCGGAGCTT 
39101 GCTGTGAGCA GAGATCACGC CACTGCACTC CAGCCTGAGC GACAGAGCGA 
39151 GACTCCATCT CAAAAAAAAA AAAGAGTTTG GGATTAGCCT GGCCAACATG 
39201 GCAAAACCCC ATCTCTACAA AAAGTACAAA AAAATTAGCT GGGTATGGTG 
39251 GTGCGCGCCT GTAATCCCAG TTACTCAGGA GGCTGAGGCA TGAGAATTGC 
39301 TTGAGCCTGG GAGGTGGAGG TTGCAGTGAG CCCAGATCAT GCCACTGCAC 
39351 TCCAGCCTGG ATGACAGAGT AAGATGCCAT CTCAAATAAA AATTAAAAAC 
39401 AAAGTTTAAA AAAAAAATAG AAGCTATTAC CGTGATCCAG GTAAGAGATG 
39451 TGAATAACTA CAATGATGGA AAGAAGGCAG AGTTCTTAGA GATGGGAGTA 
39501 GGAGAGATGA GGGAACTCCA GATTGGGAAG ATGATGTTCA AGTTTCTGGC 
39551 TTAGGCCACA GGGTGAGTGG CAATTCCCTT CACTGAGATG GGGCATCCTG 
39601 GAAAAGGTGT TGCCTTTCTG TGTGGGTATC CTGGGCCCCT TAGGGGCCAC 
39651 TGGTGGCCTG GGACCTGGTA AACCTTCCCT GCACAAGCAG AATTGGTCAA 
39701 GCAGGTTTTT AGGACATCTT TACCCTGCCT CAACTCTTGT CTGGCCCAGG 
39751 GTCAACCGGA TGCACATCAG TCCCAACAAT CGAAACGCCA TCCACCCTGG 
39801 GGACCGCATC CTGGAGATCA ATGGGACCCC CGTCCGCACA CTTCGAGTGG 
39851 AGGAGGTAGA GTGTGTGTCT AATCTGTCTT GTGAGGGTGG GACATGGAAC 
39901 AGATCCTCTG GGAAATCAGG CTGTAGCCTT TACCTTTTCC TACCCCCAGC 
39951 CCATCTCTTT GTCTTAGCAT TGAGCCTGTG ACCACTGGTG ACCTATTTCA 
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40001 GCGTAACAGG TTCCCAGGGT AGCAGGGATG GTTGATGGAC GGGAGAGCTG 
40051 ACAGGATGCC AGGCAGAGGG CACTGTGAGG CCACTGGCAG CTAAAGGCCA 
40101 CCATTAGACA AGTTGAGCAC TGGCCACACT GTGCCTGAGT CATCTGGGTT 
40151 GGCCATGGGT GGCCTGGGAT GGGGCAGCCT GTGGGAGCTT TATACTGCTC 
40201 TTGGCCACAG GTGGAGGATG : CAATTAGCCA GACGAGCCAG ACACTTCAGC 
40251 TGTTGATTGA ACATGACCCC GTCTCCCAAC GCCTGGACCA GCTGCGGCTG 
40301 GAGGCCCGGC • TCGCTCCTCA CATGCAGAAT GCCGGACACC CCCACGCCCT 
40351 CAGCACCCTG GACACCAAGG AGAATCTGGA GGGGACACTG AGGAGACGTT 
40401 CCCTAAGGTG CCACCTCCCA CCCTGGCTCT GTTCTGTCCT ATGTCTGTCT 
40451 CTCGGATGAA GCTGAGCTGG CTTTCAGAAG CCTGCAGAGT TAGGAAAGGA 
40501 ACCAGCTGGC CAGGGACAGA CTATGAGGAT TGTGCTGACC CAGCTGCCCC 
40551 TGTGGGGATC ACAGTTTACA GCCAGAGCCT GTGCGGACCC AGCTGTCTGC 
40601 CAGGTTTCCT TAGAAACCTG AGAGTCAGTC TCTGTCCACT GAACTCCTAA 
40651 GCTGGACAGG AGGCAGTGAT GCTAAACCCT GAAGGGCAAC ATGGCCTATG 
40701 GAGAAAGCAT GGAGCTCAGA GCCTGGAGTA CGGGCACAGA TAGGATTGAA 
40751 TAAATTGTGT AGAAAGAC7T TGAAAACAAT AAAGCAAAAG ATGAATGAAC 
40801 Gill II MIA GACTTGAGGG ACCAACAACC CCCAAACCCC AGATTCTGCC 
40851 AGGTCCATGG GGAAGGAGAA GTTGCCTTGA GTGGAAGCCC CAAGTAGGGA 
40901 GACTTACAGA AAAGAAGTCA AGAGCACTGG CTCCCAGGCA GAAATACTGA 
40951 TACCCTACTG GGGCTTCAGG CTGAGCTCCT CCQTTCACAA ATCACTTCAT 
41001 CTCTCTGAGC CTGTTTCTGC ATCTGTGACA TAAGATGGTA AGATAAAGGT 
41051 GGCTGTCTCA CCAATTATGT AAGGATTAAA TGTGGAAAAG GACATAAAGT 
41101 TGTATAGTGC TGCCATAGGG ACAGTGTTCA GTAAACGTGA CACATTCTTA 
41151 GTATCACTAA GAATCAGGTT CTTGGCCAGG CACCGTGGCT CATGCCTGTA 
41201 ATCCCAACAC TCTGGGAGGC CTAGGTCG6A GGATGGCTTG AACACAGGAG 

41251 TTTGAGACCA GCCTGAGCAA CATAGTGAGA CACTGTCTCT ACAAAAAAAA 

41301 AATAATAATA ATAATTGTTT TTAATTAGAT GGGGAGGGCA CTGTGGCTCA 
41351 CACCTGTAAT CCCAGCACTT TGGGAGGCCA AGGCCGGAGG ATTGCTTGAG 
41401 GCCAGGAGTT CAGGAGCAGC CTGGGCCACA TTCCTGTCTC TACAAAGAAT 
41451 AAAAAAGTTA ACTGGGCATG GTGGCACATG CCTGTAATCC CAGCTACTCA 
41501 AGAGGCTGAG GAGGAGGATT GCCTGAGCCC AGGAGTTCAA GACTGCAGTG 
41551 AGCCTTGATC ACACCACTGT AC TACAGCTT GGGCAACAGA GTGAGACCTT 
41601 GTCTCCAAAA AAAAAAGTTT Gil III II 1 1 ATCCACTCTC CTCACCAAAC 
41651 AAACTGAGTA AGTTAGAGCC CTCTCAGCTG GCATGTGTTG GAAACAGTGC 
41701 CCTCTCATTA AAGTGCTGCC CTCACTCCCA TTGCCTCTTG GCCTTGGTCA 
41751 GTATGATGAA ATTAGTGGGA GGCAGGGCAA CAGAGGGCAG GGAAGAGCTA 
41801 GAAATCCATG GCCTGGAAAA GGGAAGATTT GGGAGTGGCC AGGTATCTGT 
41851 AGAGCCACCA TGCAGAGGAG GGGGGCAGCT AGCCTTGTGT GCTCTGGTGG 
41901 GCATGGTCAG CAGGAGGCAG AGCAAAAGGA CAAGGGTAAG TAAACCTGTA 
41951 GGTCGGGACA AGCCAAGAGC CATCCAGCGT CAGTCCTCTC TGGGTAGCCC 
42001 AAGTAAAGCA GGAGCATACC CCAGAGAGAA AGTTCGCAGG GCTGTTCACC 
42051 TGCAGTGCTG TGGACTTCAA CCTTCTTGTT CCTTCTTCAG TAAGTGAAAA 
42101 TAACAGTCAT TGACCATGAC TATTATCGAC CGCTTTTGAA AATGTAAACA 
42151 TAGTGACTTT ATTGCTGTAA AAATCATACG TGTTTATCAT CTTAAAATTC 
42201 AGGAAACATG GACAGGTACA AAGATGTGCA AAATATCATC CAAAATCCCA 
42251 TTTGCTGGCC AGGCACGGTG GCTCACGCCT GTAATCCCAG CACATTGGGA 
42301 GGCCGAGGCG GGCAAATCAC TTGAGGTCAG GAGTrTGAGA CCAGCCTGGC 
42351 CAACATGGTG AAACCCTATC TCTACTAAAA ATACAATAAT TAGGCTGGGC 
42401 GCAGTGGCTC ACGCCTATAA TCCCAGCACT TTGGGAGGCC GAGGTGGGCG 
42451 AATCACAAGG TCAGGAGTTT GAGACTAGCC TGGCCAATAT GGTGAAACCC 
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42501 CATCTCTACT AAAAATACAA AAATTAGGGC CGGGTGTGGT GGCTCACGCC 

42551 TGTAATCCCA GCACTTAGGG AGGCCGAGAC AGATGGATCG CGAGATCAGG 

42601 AGTTCGAGAC CAACCTAGCC AACATGGTGA AACCCCATCT CTACTAAAAA 
42651 AATACAAAM TTATTCGGTT GTGGTGGCAC ACGCCTGTAA TCCCAGCTAC 
42701 TTGGGAGGCT GAGGCAGGAG AATCTCTTGA ACCTGGGAGG CAGAGGTTGC 
42751 AGTGAGTGGA GATCCCGCCG TTGCACTCCA GCCTGGGCGA CAGAGTGAGA 
42801 CTCCATCAAA AAAAAAAAAA AAAAAAAAAA AAATTAGCCG GGCGTGGTGG 
42851 CGTGCACCTA TACTCCCAGC TACTTGGGAG GCTGAGGCAG GAGAATCGCT 
42901 TGAACCTGGA AGGCGGAGGT CGCAGTGAGC CGAGATCGTG CCATTGCACT 
42951 TCAGCCTGGG CGACAGAGCG AGACTCTGTC TCAAAAATAA TAATAATAAC 
43001 AATAACTAGC CGGGCCTGGT GGCACATGCC TGTAGTCCCA GTTACTCAGG 
43051 AGGCGGAGGC ATGAGACTCA GGTGAACTAG GGAGACAGAG GTTGCAGTGA 
43101 GCCAAGATCA CACCACTGCA CTCCAGCCTG GTTGACA GAG CGAGACTCTG 
43151 TCTCAAAAAA AAAAAAATCC CATTTGCTCA IIIIIIGGAT ACTAGTATAA 
43201 CTATCACTCT AAACCAGTTA GTACTTAAAT CAAGCAGATA TGGGAGATGG 
43251 TGAATTACCA TCTACAGTGT TGTCATATAT GTCACATACT GAGCATTATC 
43301 AGCTAGTAGA ATCTAGTTAA TTGTTCTATG TGTGATGTAT GCAGAGTTCC 
43351 CATTTTGAAT GTGTTTTTAC TATGCTTAAA TAAATGACTG ATGTCAGCAA 
43401 CCCCAAAATG ATACATCTGA TGTAAGAGCC CCTGTTCCCC AATAATAACA 
43451 TCTAAACTAT AGACATTGGA ATGAACAGGT GCGCCTAAGT TTCCTCCCTC 
43501 CAGGGTTTCT TGGCCGGTCT CTGAGGACTA CACATCCCTA CTCCCGTCTT 
43551 TCCTCATCTT CAGGCGCAGT AACAGTATCT CCAAGTCCCC TGGCCCCAGC 
43601 TCCCCAAAGG AGCCCCTGCT GTTCAGCCGT GACATCAGCC GCTCAGAATC 
43651 CCTTCGTTGT TCCAGCAGCT ATTCACAGCA GATCTTCCGG CCCTGTGACC 
43701 TAATCCATGG GGAGGTCCTG GGGAAGGGCT TCTTTGGGCA. GGCTATCAAG 
43751 GTGAGCGCAG GCAACAATTG CTTTGCTCTT CTGCCCCCAG TCCCTCTGTC 
43801 ACTGTCTTTC GGGGATTTCT CATCACTTGG CCCCACCCCA CACCATGCAG 
43851 GATGCCAGGC CTCCTTCCTG GCTTTGGGTG TTGGTGTGAG AGGTATCCTT 
43901 CACCCCCACC CAGGCCACCT AAGGTCAATG TTGCTGTTAC AGTGAGCTTG 
43951 TGGACCTGGA GATCCAGGTT GGGTTGAGCT GTGCCTGTGG CCCTCCTGCC 
44001 TGCAGTCAGT GGGTGTTTGT TAGGTGCCTG CAGACCTCAG TACCGGGCAT 
44051 GCTACAAGGA GCACACAGGG GAATGGCTCC TGCCTCCCTG GTGAACAGTC 
44101 TCAGGGACTA ACCTCTCTCT TTCTCTCCTC CTCCTCCTCT TCTGCTGAGA 
44151 ACTGGGAGGG GGGGTCAGGT AAGACGTGTG TCTCAGCTTG GGGGCAGCAG 
44201 GGCTGGAGAG CTCACCCCCG ATCCACCCAG CTCCCTGGTG CATGTCTTTG 
44251 GCACTGACCT TCCTGCCCCC AGACTTCTGT TCACTCAGGA GACTCACTTC 
44301 TATGCCAAAT GACCAGAGCC CCTGCTTGGC TTGGCAGCAT CCCCTCCTGC 
44351 CnCHCCCC ACTTCCCTTT TCTGGGTTCT TGCCTGTCCT CTGTGCATGC 
44401 CCAGCTCTCC AGGAAAGAGG GTTTGCTTCC GTGTGAGTCC CATGTTGCTC 
44451 CACGCTGCAT CTTCCACACA TGAACTCTGT CATTCTGACC CGGCTCAGTG 
44501 TGCCCTCCAA GGGATGGGAT GGCCAGCTGC ATAGATTTTC TCAAACAGTT 
44551 CTCCAGAACT TCCTCTGGTC TCAGCACCAT TAACAGTCAC CCTCCCTGTA 
44601 GGTGACACAC AAAGCCACGG GCAAAGTGAT GGTCATGAAA GAGTTAATTC 
44651 GATGTGATGA GGAGACCCAG AAAACTTTTC TGACTGAGGT AAGAAGATGG 
44701 AGGGGGCCCG GGAGGTTGGT GTCACCATTG GAAGAGAGAA GACCTTACAA 
44751 ATAATGGCTT CAAGAGAAAA TACAGTTTGG AATTACTGTC TTAAAGACTA 
44801 AGCAGAAAAG AGCCCTAGAG GAATATCCCA CTCCCTCTAA ATTACAGCGT 
44851 AATTATTTGT TCAATGAACA CTTACTAAAA GCAACACAAA CAGGGTACAA 
44901 GGGATGCAGT AACAAAAGAT ACAGGGTTCA GAAGAGCTCT CAGGTTATGA 
44951 GGATGATGGA CATGAAAACA CTCCAATTTA GTACAACTCA ATGTTATAAT 
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45001 CCTCACCTGA ACGCCCTGCT AAGGGAGCCT GGAGGGGAGC TCCCTGAGCA 
45051 CTCACACTCC TTGGGCATTT ACAGTTTTCA CTACCCCTCC CAAGTTACTT 
45101 CATGGAGTAA CTTAAGTTGG GGACACCTGT GGTCTGGGTA 1TGCCCTCCA 
45151 AGCCACTTGG CCACTCCCAC CCCAGTTCTC CCAATGCAGT TCCAAGGGTA 
45201 AGGCCTATGA AGCCATCTCC ATCTATATGG TGGTGGTCTT CCCTCATCCT 
45251 GATCTTAGTG CCCTGTCATA TCACAAGATA GGAGGTAGGA GATACAGGTG 
45301 GTAACACTTG TCAAGCTGAT TCCTTGGAGG GAAGAGGTAA GGAAGACAGT 
45351 GAGAAGTTM CCACCAGCTT TCCTTGGCTT CCCCCACCCC CAGGTGAAAG 
45401 TGATGCGCAG CCTGGACCAC CCCAATGTGC TCAAGTTCAT TGGTGTGCTG 
45451 TACAAGGATA AGAAGCTGAA CCTGCTGACA GAGTACATTG AGGGGGGCAC 
45501 ACTGAAGGAC TTTCTGCGCA GTATGGTGAG CACACCACCC CATAGTCTCC 
45551 AGGAGCCTTG GTGGGTTGTC AGACACCTAT GCTATCACTA CCCTAGGAGC 
45601 TTAAAGGGCA GAGGGGCCCT GCTTTGCCTC CAAAGGACCA TGCTGGGTGG 
45651 GACTGAGCAT ACATAGGGAG GCTTCACTGG GAGACCACAT TGACCCATGG 
45701 GGCCTGGACC ACGAGTGGGA CAGGGCTCAA CAGCCTCTGA AAATCATTCC 
45751 CCATTCTGCA GGATCCGTTC CCCTGGCAGC AGAAGGTCAG GTTTGCCAAA 
45801 GGAATCGCCT CCGGAATGGT GAGTCCCACC AACAAACCTG CCAGCAGGGC 
45851 GAGAGTAGGG AGAGGTGTGA GAATTGTGGG CTTCACTGGA AGGTAGAGAC 
45901 CCCTTCCTAT GCAACTTGTG TGGGCTGGGT CAGCAGCTAT TCATTGAGTT 
45951 TGTCTGTGTC ACTGAAACTG ACCCCAGCCA ACTGTTCTCA GTTCACAGCC 
46001 CTGTTTTCAA AGAATTACAC ATCTCTAAAG GCAAACAGGG CACGGACAAG 
46051 GCAAACTGGA GAGGCAAACT GTAGCCTGAG ATGGCCTGGG CTTGCCATCA 
46101 CAGGTA7TCA GGTGCTGAGG GCCCTTAGAC CAACTAGAGC ACCTCACTGC 
46151 CTAGGAAATC AATGAAGGGG AAATGAGTTC TAGCGGAGCC CTGAAGGATC 
46201 AGAATTGGAT AAAGTTCTTA TTGGCAGAGA GGCACCAGGA TTGAAGTGAC 
46251 AGGAGCAAAG ACCTGGGAGG AAAGAGGAGA AAATCAtCTA TTTCACCTGG 
46301 AAACAAATGA TTCCAAGCAT AGAAATAATA ACAGCTGACA AGTACTGAGT 
46351 GCCCTCTATA TGCTAGGCAC TGGGCTGAGG GATTAACATG CATGTGCATG 
46401 TTTATTCCTC ATGACAACCT TGGTTTCCAG ATAAGCTGGA CTGGAAAGGG 
46451 ACAGAGCTGG GATCCTGGGC TAATCAGTCT GGTCGCCMG CCTGAGACTT 
46501 TAGCCACTGC CCTTCACATG GGGGTCCATG AAAATAGTAG TAGTCTGGAA 
46551 CAGTTTGGGG GTACATCAAG GTCG CTGTGT TTTAAGCTAT GGAGTC TGGA 
46601 CTATAGGAGA CAAATGTAAA AGAGIIIIII GGTTGACTGG CTTTTTGGTT 
46651 TTT7TGTTTG TTTGTTTGTT TGTTTGTTTG TTTGTTTGTT TTTTCCTGTT 
46701 TCTGGGGCTT GAATCAGGAA GGAGGTTTTT TTGTTGTTGT TGTTTTGAGA 
46751 AAGGATA7TG CTCTGTTGCC CAGACTGGAG TGCAGTGGCA CGATCATGGC 
46801 TCACTACAGC TTCGACCTCC TGGGCTCAAG CAATCCTCCT GCCTTA GCCT 
46851 CCCAAGTAGC TGGACTACAG G TGTGTACCA CCACACCTAA TTTTTTGAAT 
46901 TTTTTTTTCT llllllllll 1 1 II 1 1 1 M I GGTAGAGACA GGTTCTCACT 
46951 TTGTTGCCCA GGCCTGAATC TCAAACTCCT GGGCTCAAGC ATTCCTCCTG 
47001 CCTCGCCCTC CCAAAGTGTT GGGATTACAG TTGTGAGCCA CCATGC CCGG 
47051 CAGGAAAAGA TTTTTAAGCA AGAAAGCTTA AGAGCTGTGG TTTT TCCAAA 
47101 ATGAGTCTGG GCTGGCACAG TGGCTCATGC CTGTAATCCC AGCACTTTTT 
47151 TGGGAGGCCG AGGTGAGTGG ATCACTTGAG GTCAGGAGTT TGAGACCAGC 
47201 CTGGCCAACT GGTGAAACCC CTGTTTCTAC TAAAGAAAAA AATGCAAAAA 
47251 TTAGCTGGGC GTGGTGGTGC ACGCCTGTAG TCCCAGCTAC TCAGGAGGCC 
47301 GAGGCAGGAG AATAGCTTGA ACCTGGGAGG CAGAAGTTGC AGTGAGCCAA 
47351 GATCACACCA CTGCATTCCA GCCTGGGTGA CAGAGTGAGA CTTCATCTCA 
47401 AAAAAAAAAA AAAAGAGAGA CTGATATGGT TAGTACATTG GGGTGGAATG 
47451 CGGAGGGTCC AGGGAATGGA GCCCTGCATA GGGGGCTAAT GAAACATTTC 
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47501 AGATTTCTGA ATTMGGTAG TGGCTGTGGG GACAGGAGCC TGGGAGGCAG 
47551 GGTG6AGTCA GAAT66A6AG ACTGGTTGGC AATGAGGGAA CAGGAGGAGG 
47601 AGGAGGAGGA GTTACGAGTG GCTTGAGGTG TCACTTACCA GACATTTGGG 
47651 GGATGGGGGA TAGCCGTGAT TGTTGAGCAA CTGGTTTGGG AAGAGCTAGC 
47701 ATTGATCCCT GCTGTTCTGT GCTAGCAGAA CCTATCAGCA TCTTCTGGGC 
47751 AGGAAACTGG CTCCATGAGA CTGGCTTAGG GAGAGGCTGC TAGTCACCtA 
47801 ATCTGCAGAG AAGGGGCAGC TGGAGCTGTG GGACAGAAGA GGCATCCATG 
47851 TAGCTGGTGG GGGTGTCTCA GCTTGTGAAG AGGAGATGGC TTTGAGCAGG 
47901 GCTGACACTG AAAAGGCTGG AAGAAAAAAA CAGACACACA AGAGTCTCAG 
47951 GATCAGGTAG CATAGGAAAG TTGTGGACAG TCTTTGAGGA GCACTCCCTC 
48001 AGGCAGGCAG GCAGGCAGGT CATGAGCTAT AGCGATTCAG GAAGAGCTCC 
48051 CTGGGTGTGT GAGCAGCTCC AGGAGCCTAA GGGATGAAAG TAGTATTGCA 
48101 GGGGGCTGGA GAGCAAGGAG TGGCTCCTTC TACATTTGCA AGGGAAGGAG 
48151 AAAGGMGTT GCTCCTGAGA GTGGTAAGAG TCAGTGGTGG AGGCCTGGAG 
48201 AGGAGACATA ACAAACAAAT TTGTTGACAA ACATTTTGGT AGGAAGGGGG 
48251 AGAGCTTAAA GTTTAGACAG TGGGGAAGGT GGAGTCTTAG AGGAGGTGAA 
48301 TGTCTGAAAG ACAGAGCTAG CTGGAGCAAG AAGTCACTTC TCTGTTGCAG 
48351 GCAGGAAGGA TCCAAAGTGG CTCAAGCCAG AGATTGGGAG AGTGGGGAGG 
48401 AGGGAGCAGC CTGGATCTAA GTAAAATGGG TAGAGGTGGA GGGGGTGCTG 
48451-CAACGGCCAG GGTTTTCTGA AGTTGGGGAC ATTAGGAGAG AGCTGTGAGG 
48501 GCTTTGGCCA GCCACTGTGC TAGTGATTGG TGAACCAAAG GATGGGCAGG 
48551 AGATGGCAGC AGGGAAGCAG AGGAAGTCCA GGCTTCCTGT TGGTATTGGG 
48601 ACAAGGGAGA GGCCATAGGA GGCCCTGGCC CTGTTGTCCA GGTTGGGTTC 
48651 TGAAGCTGGG TGGGCATGGC CTGGTAGGAG AGCATCTATG GCGCCCAATT 
48701 CCAGATTCAG GGTCTAGTTG ATTTGCTGGC CCTGTAGCCT CAGCTCATGC 
48751 TTCTGTTCCA GGCCTATTTG CACTGTATGT GCATCATCCA CCGGGATCTG 
48801 AACTCGCACA ACTGCCTCAT CAAGTTGGTA TGTCCCACTG CTCTGGGCCT 
48851 GGCCTCCAGG GTCCTATCCT TCCTGGCTTC CTTGTCACAA AGGAGGCTGA 
48901 CTTGTCCCCT CTGGCTAGAG GGCAGAGGTG TTGCCTAGGA GCTCCTATCT 

48951 TTCCCTTCCT GCTTCTTCCA ATGCCCTTCT CTGTCCTCTG GGAGCTCCGA 

49001 6ACACACACA GACATAATTT CACCTTCTCT CATTAGCAAC CTTTGAAATA 

49051 ATTTGATTAG AAGGGACTTC AGAAGTTTGT TGACTATATG TAGAAAACCC 

49101 TGTCATTTTA CCTGCTTTTG CCCCATAGTA GTCTTGTAAA ACAGTTCATT 

49151 GCTGACCCCA TTTTACAGTG GTGGCACCTG AAGCCTCAGC CTGAGGCCAC 

49201 CGAGCTAGTA AATTTACAGG GACCAGTTTG AGACCAGCAT TCCTCCCACT 
49251 GCCCCTCAGC TGTGGTGGTT ACAATGTTGT TTGTCTTACT GACTTGCTAT 
49301 CTGGCTTCCT GGGTGTCTAC CGGCTGGCCC TGGCTCTGCC CTCTAGACCC 
49351 ACACCACGCA ATCTTCATTC CTTTCCCACA TGACTGCCCT GTAGCTATTC 
49401 AAAGAGCTTG TCTCCCCCAA GTCTCCCCAT CTACTGGCTC CACCTTGCCT 
49451 TTTTCTGTCT TATCCTGGTT CTAGCCACTG CCTGAAATCA TTTTAGGAAT 
. 49501 AAGACAGGAC AGGGAAAAAC AAAAGCAACC CCCTGTCCCA CCTCTGAGTT . 
49551 CCACTCTCCA AGTCCCTGAG CCTCACCTCC AGGGCTCCAG TGGCTCTGCC 
49601 ATGAACCCAC TGTGGGCTGG GAGTCTGCTG TGCACAGATA CCAGACCCTC 
49651 AGAAACACAA ATGCCAAGTG TGTCTGTTTT TTTGTTTTGT TTTGTTTTGT 
49701 TTTTTAGATG GAGTCTCATT CTGTTTCCCA GGCTGGAGTG CAGTGGTGCA 
49751 ATCTTGGCTT ACTGCAGCCT CTACCTCCCG GGTTCTAGTG ATTGTTCTGC 
49801 TTCAGCCTCC CAGTAGCTAG GAC TACAGGC GTGTGCCACC A CGCCC AGCT 
49851 AAI 1 1 1 1 1 1 1 llllllllll TGTATTTTTA GTAGAGACAG GGTTTTGCCA 
49901 TGTTGGCCAG GCTGGTCTTG AACTCCTGAC CTCAGGTGAT TCACCCGCCT 
49951 TGGCCTCCCA AAGTTCTGGG ATTACAGGTG GAAGCCACCG TGCCTGGCCT 
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50001 GAGTGTGTCT ATTTGATAGA GCTTTCTGCT CTGATTCTCC CTTGCTATAC 
50051 ACCTTTTCTC CCCTTCTCAG TGGCTTCTCT TGCCTATGCT TCCTCCCCAG 

50101 GGCCAGGTTT GAGAACATCC CCATGAAGTC CTGACCTGTC TTTTATCCTA 
50151 CCAGGACAA6 ACTGTGGTGG TGGCAGACTT TGGGCTGTCA CGGCTCATAG 

50201 TGGAAGAGAG GAAAAGGGCC CCCATGGAGA AGGCCACCAC CAAGAAACGC 
50251 ACCTTGCGCA AGAACGACCG CAAGAAGCGC TACACGGTGG TGGGAAACCC 

50301 CTACTGGATG GCCCCTGAGA TGCTGAACGG TGAGTCCTGA AGCCCTGGAG 
50351 GGGACACCCG CAGAGGGAGG ACAGATGCTG CCC7TGCATC AGAGCCCTGG 
50401 GAATTCCAGG GGAGGCCTGT GAAGCGTAGG ACCGGATACC CAGAGCTGAG 
50451 GATATTTTTC CC7TGCCAGG TGGGGCCTCA CGATTTAGCT CCTGAGCTCA 
50501 GGGGGCTGGG AACTGATCAG TGTCCCATCA TGGGGGATAA GGTGAGTTCT 
50551 GACTGTGGCA TTTGTGCCTC AGGGATCGCT AAGAGCTCAG GCTATTGTCC 
50601 CAGCTTTAGC CTTCTCTCTC CATGGTGAGA ACTGAAGTGT GGTGCCCTCT 
50651 GGTGGATAAT GCTCAAACCA ACCAGAGATG CTGGTTGGGA TTCTTGAAAT 
50701 CAGGGTTGTG AGGCCTCAGA AATGGTCTGA ATACAATCCA TTTTGGAGTC 
50751 TGAGGCCCAG AGAAGTTCAG TGAATTGCCT AGGAGCATAC AGCTGCCTAA 
50801 TGGCAGAGGC TAGATGAACC CTAGTCTGGT TCTTTTCCAC TTTAACGTGC 
50851 AGTTTCATCC TAGGCAGTGT TATGTTATAA GGGCTCTCCA AGGCAGTTCA 
50901 CCTACGGCTG AGGAAGGACT ATTTTCAGGT GGTGTCTGCG CAGGACAGCC 
50951 TGTGGGGTGT CCCTACAGAA CCTGTTCTAG CCCJAG1TCT TAGCTGTGGC 
51001 TTAGATTGAC CCTAGACCCA GTGCAGAGCA GGTAAGGGAT GTAAACTTAA 
51051 CAGTGTGCTC TCCTGTGTTC CCCAAGGAAA GAGCTATGAT GAGACGGTGG 
51101 ATATCTTCTC CTTTGGGATC GTTCTCTGTG AGGTGAGCTC TGGCACCAAG 
51151 GCCATGCCCG AGGCAGCAGG CCTAGCAGCT CTGCCTTCCC TCGGAACTGG 
51201 GGCATCTCCT CCTAGGGATG ACTAGCTTGA CTAAAATCAA CATGGGTGTA 
51251 GGGTTTTATG GTTTATAACG CATCTGCACA TCTTTG CCAC GTTCGTGT7T 
51301 CAtTGGTCTT AAGAGAAGGA CTGGCAGGGT TTTTTTGTTT TAGATGGAGC 
51351 CTCACTTCGT TGCCCAGGCT GGAGTGCAGT GGCACAATCT GGGCTCACTG 
51401 CAACCTCTGC CTTCTGGGTT CAAGTGATTC TCCTGCCTCA GCCTCCCAAG 
51451 TAGCTGGGAC TACCGGCACA CACCACCATG CCCGGCTAAT TTTTGTATTT 
51501 TTAGTAGAGA CAGGGTTTCA CCATGTTGGC CAGGCTGGTC TTGAACTCCG 
51551 GACCTCAGGT GATCCGCCTG CCTCAGCCTC TAAAAGTGCT GGAAT TAATA 
51601 GGCGTGAGCT ACCTCGCCCG GCCAGGTTTT llllllllll TTTTTAGTTG 
51651 AGGAAACTGA GGCTTGGAAG AGGGCAGTGG CTTGCACATG GTCGATAAGG 
51701 GGCAGATGAG ACTCAGAATT CCAGAAGGAA GGGCAAGAGA CTGTTCATGT 
51751 GGCTGTCTAG CTAGCTCTTG GGCCAAATGT AGCCCTTCTC AGTTCCCTTC 
51801 AAGTAGAAGT AGCCACTCTA GGAAGTGTCA GCCCTGTGCC AGGTACCACG 
51851 TGGACAGAGT GAGGAATCTT GGAAAGATTC CTACCTTTAG GAGTTTAGTC 
51901 AGGTGACAGC ATATCTCAGC GACTCAAACA CACACACATT CAAAGCCTTC 
51951 TGTAATTCCT ACAAAGTTGT GAGGGGTAGA G GAGAGG AGA GACAAGGGAT 
52001 GGTTAGGATA ATGAAGGAAT GTTTTGTTTT TGTTTTTGTT TTTGAGATGG 
52051 AGTTTCACTC TGTCACCCAG GCTGGAGTGC AGAGGTGCAA TCTTGGCTCA 
52101 CTGCAGCCTC CGCCTCCCAG GTTCAAGCAA TCCTCCTGCC TC AGCCT CCC 
52151 AAGTAGCTGG GACTACAGGT GTGCGCCACC ACGCCTGGCT AATTTtTGTA 
52201 TTTTCAGTAG AGACAGGGTT TCGCCATATT GGCCAGGCTG GTCTCAAATG 
52251 CCTGACCTCA GGTGATACAC CCGCTTCAGC CTCCCAAAGT GCTGAGATTA 
52301 CAGGCATGAG CTACCGTGCC TGGCCATGAA GGAAGATTTG TTTTAAAAAA 
52351 TTGTTTTCTT TAATATTAAT TGAACACCTC TGTTCAGAGC ACTGGGCTGG 
52401 TGCCAGAGGG TTTCAGACAT GAATCAGATC CAGCACCTCA TAGAGCCTTA 
52451 ATCTGGCACA CACACACAGC CACAAGGAGA CACAGACAAG GCAGGGTAGG 
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52501 ATGAGT6GAA GCTAGGAGCA GATGCTGATT TGGAACACTT GGCTTCTGCA 
52551 GTGAAGCCCC TTCTTAGTCC TCTTCAGTAA CCCAGCTCTC AGTGGATACA 
52601 GGTCTGGATT AGTAA6ATTT GGAGAGATGA TTGGGGATTG GGGAGAGCTC 
52651 TCTAACCTAT TTTACCACCT CCTCTTCTGC CATTCTTCCT GTCCACATCC 
52701 CCAGCATCCC TTTCCCTTGC CAAGTATCTG TGGCCTCTGT AGTCCTttGT 
52751 AAACAGCTGT CTTCTTACCC TACAGATCAT TGGGCAGGTG TATGCAGATC 
52801 CTGACTGCCT TCCCCGAACA CTGGACTTTG GCCTCAACGT GAAGCTTTTC 
52851 TGGGAGAAGT TTGTTCCCAC AGATTGTCCC CCGGCCTTCT TCCCGCTGGC 
52901 CGCCATCTGC TGCAGACTGG AGCCTGAGAG CAGGTTGGTA TCCTGCCTTT 
52951 TTCTCCCAGC TCACAGGGTC CTGGGACGTT TGCCTCTGTC TAAGGCCACC 
53001 CCTGAGCCCT CTGCAAGCAC AGGGGTGAGA GAAGCCTTGA GGTCAAGAAT 
53051 GTGGCTGTCA ACCCCTGAGC CATCTGACAA CACATATGTA CAGGTTGGAG 
53101 AAGAGAGAGG TAAAGACATA GCAGCAAGTA ATCTGGATAG GACACAGAAA 
53151 CACAGCCATT AAAAGAAAGT TTAAAAGAAG GAAATTCACC CAAACCATTT 
53201 GAATACAGTA AGTGTATTCA TCTTTCGATA TTCCCCTGTC CATATCTACA 
53251 CATATACTTT TTTTTATAGT AAATAGTTCT GT ATTTTG CC CTGCATTTCC 
53301 CTTGTGTTTA CTATCCAGTC TTCCTGTTTA TCATTTTTGT CGACAACATG 
53351 AAATTCTATT GAGAGACTGT CTGAACATAT TGTAATGTAG ATGTTCAGGT 
53401 TTTTCCAGTT TCTCTTTACA ATAGGTATTT AACTACAGTG AGCAGTTTTA 
53451 TGCATTTAGC TAATTTCTCC TTTGAGGAAG TATTTTCAAA ATTACCTTTA 
53501 TTCTTCTCAG GTAATAATTT CATTATTACC AAAGTTACCC TAGGTCTTTT 
53551 CAAGTGTGTG GTTAAAAAAC GAGAATCTGG CTGGGCGCGA TGGCTCACAC 
53601 CTGTAATCCC AGCACTTTGG GAGGCTGAGG CTGGTGGATC ACCTGAGGTC 
53651 TGGAGTTCGA GACCAGCCTG GCCAACATGG TGAAACCCCA TCTCTACTAA 
53701 AAATACAAAA CTTAGCCAGG CATGGTGGCA GGTGCCTGTA ACCCCAGCTA 
53751 CTTGGGAGGC TGAGGCAGGA GAATTGCTTG AACCCAGGGG CGGAGGTTGC 
53801 AGTGAGCCGA TATCACGCCA TTGCACTCCA GCCTCGGCAA CAAGAGTGAA 
53851 ACTCTGTCTC AAAAATGGGG TTCTTTTCCT GCCATCAAAA ATCATGTTTC 
53901 TTTTAAAAAC AAGTTCAAAC ATTACCAAAG TTTATAGCAC AGGAAATACG 
53951 TCTTCTGTAA TCTCCCTTAA CCAATATATC CCTCAACATT CTCCTCACCC 
54001 CCAACTCCAC CCTCCCAGGA TAACCAGTTG GGACATAATC TTTATTTAAA 
54051 AATGGTTTCC GGATAGAGAA AGCGCTTCGG CGGCGGCAGC CCCGGCGGCG 
54101 GCCGCAGGGG ACAAAGGGCG GGCGGATCGG CGGGGAGGGG GCGGGGCGCG 
54151 ACCAGGCCAG GCCCGGGGGC TCCGCATGCT GCAGCTGCCT CTCGGGCGCC 
54201 CCCGCCGCCG CCCTCGCCGC GGAGCCGGCG AGCTAACCTG AGCCAGCCGG 
54251 CGGGCGTCAC GGAGGCGGCG GCACAAGGAG GGGCCCCACG CGCGCACGTG 
54301 GCCCCGGAGG CCGCCGTGGC GGACAGCGGC ACCGCGGGGG GCGCGGCGTT 
54351 GGCGGCCCCG GCCCCGGCCC CCAGGCCAGG CAGTGGCGGC CAAGGACCAC 
54401 GCATCTACTT TCAGAGCCCC CCCCGGGGCC GCAGGAGAGG GCCCGGGCTG 
54451 GGCGGATGAT GAGGGCCCAG TGAGGCGCCA AGGGAAGGTC ACCATCAAGT 
54501 ATGACCCCAA GGAGCTACGG AAGCACCTCA ACCTAGAGGA GTGGATCCTG 
54551 GAGCAGCTCA CGCGCCTCTA CGACTGCCAG GAAGAGGAGA TCTCAGAACT 
54601 AGAGATTGAC GTGGATGAGC TCCTGGACAT GGAGAGTGAC GATGCCTGGG 
54651 CTTCCAGGGT CAAGGAGCTG CTGGTTGACT GTTACAAACC CACAGAGGCC 
54701 TTCATCTCTG GCCTGCTGGA CAAGATCCGG GCCATGCAGA AGCTGAGCAC 
54751 ACCCCAGAAG AAGTGAGGGT CCCCGACCCA GGCGAACGGT GGCTCCCATA 
54801 GGACAATCGC TACCCCCCGA CCTCGTAGCA ACAGCAATAC CGGGGGACCC 
54851 TGCGGCCAGG CCTGGTTCCA TGAGCAGGGC TCCTCGTGCC CCTGGC CCAG 
54901 GGGTCTCTTC CCCTGCCCCC TCAG TTTTC C ACTTTTGGAT 1 1 1 1 1 IATTG 
54951 TTA7TAAACT GATGGGACTT TGTGTTTTTA TATTGACTCT GCGGCACGGG 
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55001 CCCTTTMTA AA6CGAGGTA GGGTACGCCT TTGGTGCAGC TCAAAAAAAA 
55051 AAAAAAAAAT GATTTCCAGC GGTCCACATT AGAGTTGAAA TTTTCTGGTG 
55101 GGAGAATCTA TACCTTGTTC CTTTATAGGC CAAGGACCGC AGTCCTTCAG 
55151 TAACACCAGT GTAAAAGCTT GAGGAGAAAT TGTGAAGCTA CACAGTATTT 
55201 GTTTTCTAAT ACCTCTTGTC ATTCTAAATA TCTTTAATTT ATTAAAAAAT 
55251 ATATATATAC AGTATTGAAT GCCTACTGTG TGCTAGGTAC AGTTCTAAAC 
55301 ACTTGGGTTA CAGCAGCGAA CAAAATAAAG GTGCTTACCC TCATAGAACA 
55351 TAGATTCTAG CATGGTATCT ACTGTATCAT ACAGTAGATA CAATAAGTAA 
55401 ACTATATTGA ATATTAGAAT GTGGCAGATG CTATGGAAAA AGAGTCAAGA 
55451 CAAGTAAAGA CGATTGTTCA GGGTACCAGT TGCAATTTTA AATATGGTCG 
55501 TCAGAGCAGG CCTCACTGAG GTGACATGAC ATTTAAGCAT AAACATGGAG 
55551 GAGGAGGAGT AAGCCTGAGC TGTCTTAGGC TTCCGGGGCA GCCAAGCCAT 
55601 TTCCGTGGCA CTAGGAGCCT GGTG7TTCCG ATTCCACCTT TGATAA CTGC 
55651 ATTTTCTCTA AGATATGGGA GGGAAGTTTT TCTCCTATTG TTTTTAAGTA 
55701 TTAACTCCAG CTAGTCCAGC CTTGTTATAG TGTTACCTAA TCTTTATAGC 
55751 AAATATATGA GGTACCGGTA ACATTATGCC CATTTCTCAC AGAGGCACTA 
55801 CTAGGTGAAG GAGTTTGCCT GACGTTATAC AACCAGGAAG TAGCTGAGCC 
55851 TAGATCCCTT CCACCCACCC CATGGCCCTG CTCATGTTCC ACCTGCCTCT 
55901 AATTTACCTC TTTTCCTTCT AGACCAGCAT TCTCGAAATT GGAGGACTCC 
55951 TTTGAGGCCC TCTCCCTGTA CCTGGGGGAG CTGGGCATCC CGCTGCCTGC 
56001 AGAGCTGGAG GAGTTGGACC ACACTGTGAG CATGCAGTAC GGCCTGACCC 
56051 GGGACTCACC TCCCTAGCCC TGGCCCAGCC CCCTGCAGGG GGGTGTTCTA 
56101 CAGCCAGCAT TGCCCCTCTG TGCCCCATTC CTGCTGTGAG CAGGGCCGTC 
56151 CGGGCTTCCT GTGGATTGGC GGAATGTTTA GAAGCAGAAC AAGCCATTCC 
56201 TATTACCTCC CCAGGAGGCA AGTGGGCGCA GCACCAGGGA AATGTATCTC 
56251 CACAGGTTCT GGGGCCTAGT TACTGTCTGT AAATCCAATA CTTGCCTGAA 

56301 AGCTGTGAAG AAGAAAAAAA CCCCTGGCCT TTGG6CCAGG AGGAATCTGT 

56351 TACTCGAATC CACCCAGGAA CTCCCTGGCA GTGGATTGTG GGAGGCTCTT 
56401 GCTTACACTA ATCAGCGTGA CCTGGACCTG CTGGGCAGGA TCCCAGGGTG 

56451 AACCTGCCTG TGAACTCTGA AGTCACTAGT CCAGCTGGGT GCAGGAGGAC 
56501 TTCAAGTGTG TGGAGGAAAG AAAGACTGAT GGCTCAAAGG GTGTGAAAAA 
56551 GTCAGTGATG CTCCCCCTTT CTACTCCAGA TCCTGTCCTT CCTGGAGCAA 
56601 GGTTGAGGGA GTAGGTTTTG AAGAGTCCCT TAATATGTGG TGGAACAGGC 
56651 CAGGAGTTAG AGAAAGGGCT GGCTTCTGTT TACCTGCTCA CTGGCTCTAG 
56701 CCAGCCCAGG GACCACATCA ATGTGAGAGG AAGCCTCCAC CTCATGTTTT 
56751 CAAACTTAAT ACTGGAGACT GGCTGAGAAC TTACGGACAA CATCCTTTCT 
56801 GTCTGAAACA AACAGTCACA AGCACAGGAA GAGGCTGGGG GACTAGAAAG 
56851 AGGCCCTGCC CTCTAGAAAG CTCAGATCTT GGCTTCTGTT ACTCATACTC 
56901 GGGTGGGCTC CTTAGTCAGA TGCCTAAAAC ATTTTGCCTA AAGCTCGATG 
56951 GGTTCTGGAG GACAGTGTGG CTTGTCACAG GCCTAGAGTC TGAGGGAGGG 
57001 GAGTGGGAGT CTCAGCAATC TCTTGGTCTT GGCTTCATGG CAACCACTGC 
57051 TCACCCTTCA ACATGCCTGG TTTAGGCAGC AGCTTGGGCT GGGAAGAGGT 
57101 GGTGGCAGAG TCTCAAAGCT GAGATGCTGA GAGAGATAGC TCCCTGAGCT 
57151 GGGCCATCTG ACTTCTACCT CCCATGTTTG CTCTCCCAAC TCATTAGCTC 
57201 CTGGGCAGCA TCCTCCTGAG CCACATGTGC AGGTACTGGA AAACCTCCAT 
57251 CTTGGCTCCC AGAGCTCTAG GAACTCTTCA TCACAACTAG ATTTGCCTCT 
57301 TCTAAGTGTC TATGAGCTTG CACCATATTT AATAAATTGG GAATGGGTTT 
57351 GGGGTATTAA TGCAATGTGT GGTGGTTGTA TTGGAGCAGG GGGAATTGAT 
57401 AAAGGAGAGT GGTTGCTGTT AATATTATCT TATCTATTGG GTGGTATGTG 
57451 AAATATTGTA CATAGACCTG ATGAGTTGTG GGACCAGATG TCATCTCTGG 

FIG. 3-23 



U.S. Patent Jan, 22, 2002 Sheet 29 of 41 US 6,340,583 Bl " 

57501 TCAGAGTTTA CTTGCTATAT AGACTGTACT TATGTGTGAA GTTTGCAAGC 

57551 TTGCTTTAGG GCTGAGCCCT GGACTCCCAG CAGCAGCACA GTTCAGCATT 

57601 GTGTGGCTGG TTGTTTCCTG GCTGTCCCCA GCAAGTGTAG GAGTGGTGGG 

57651 CCTGAACTGG GCCATTGATC AGACTAAATA AATTAAGCAG TTAACATAAC 

57701 TGGCAATATG GAGAGTGAAA ACATGATTGG CTCAGGGACA TAAATGTAGA 

57751 GGGTCTGCTA GCCACCHCT GGCCTAGCCC ACACAAACTC XCCATAGCAG 
57801 AGAGTTTTCA TGCACCCAAG TCTAAAACCC TCAAGCAGAC ACCCATCTGC 
57851 TCTAGAGAAT ATGTACATCC CACCTGAGGC AGCCCCTTCC TTGCAGCAGG 
57901 TGTGACTGAC TATGACCTTT TCCTGGCCTG GCTCTCACAT GCCAGCTGAG 
57951 TCATTCCTTA GGAGCCCTAC CCTTTCATCC TCTCTATATG AATACTTCCA 
58001 TAGCCTGGGT ATCCTGGCTT GCTTTCCTCA GTGCTGGGTG CCACCTTTGC 
58051 AATGGGAAGA AATGAATGCA AGTCACCCCA CCCCTTGTGT 7TCCTTACAA 
58101 GTGCTTGAGA GGAGAAGACC AGTTTCTTCT TGCTTCTGCA TGTGGGGGAT 
58151 GTCGTAGAAG AGTGACCATT GGGAAGGACA ATGCTATCTG GTTAGTGGGG 
58201 CCTTGGGCAC AATATAAATC TGTAAACCCA AAGGTGTTTT CTCCCAGGCA 
58251 CTCTCAAAGC TTGAAGAATC CAACTTAAGG ACAGAATATG GTTCCCGAAA 
58301 AAAACTGATG ATCTGGAGTA CGCATTGCTG GCAGAACCAC AGAGCAATGG 
58351 CTGGGCATGG GCAGAGGTGA TCTGGGTGTT CCTGAGGCTG ATAACCTGTG 
58401 GCTGAAATCC CTTGCTAAAA GTCCAGGAGA CACTCCTGTT GGTATCTTTT 
58451 CTTCTGGAGT CATAGTAGTC ACCTTGCAGG GAACTTCCTC AGCCCAGGGC 
58501 TGCTGCAGGC AGCCCAGTGA CCCTTCCTCC TCTGCAGTTA TTCCCCCTTT 
58551 GGCTGCTGCA GCACCACCCC CGTCACCCAC CACCCAACCC CTGCCGCACT 
58601 CCAGCCTTTA ACAAGGGCTG TCTAGATATT CATTTTAACT ACCTCCACCT 
58651 TGGAAACAAT TGCTGAAGGG GAGAGGATTT GCAATGACCA ACCACCTTGT 
58701 TGGGACGCCT GCACACCTGT CTTTCCTGCT TCAACCTGAA AGATTCCTGA 
58751 TGATGATAAT CTGGACACAG AAGCCGGGCA CGGTGGCTCT AGCCTGTAAT 
58801 CTCAGCACTT TGGGAGGCCT CAGCAGGTGG ATCACCTGAG ATCAAGAGTT 
58851 TGAGAACAGC CTGACCAACA TGGTGAAACC CCGTCTCTAC TAAAAATACA 
58901 AAAATTAGCC AGGTGTGGTG GCACATACCT GTAATCCCAG CTACTCTGGA 
58951 GGCTGAGGCA GGAGAATCGC TTGAACCCAC AAGGCAGAGG TTGCAGTGAG 
59001 GCGAGATCAT GCCATTGCAC TCCAGCCTGT GCAACAAGAG CCAAACTCCA 
59051 TCTCAAAAAA AAAAA (SEQ ID NO: 3) 
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CHROMOSOME MAP POSITION: 

Chromosome 22 
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DNA 

Position ■ 

941 GAGTMGTGGGTGGTCAGGnACAGACTTMTnTGGGTTAAAAAGTAAAAACAAGAAAC 
AAGGTGTGGCTCTAAAATAATGAGATGTGCTGGGGGTGGGGCATGGCAGCTCATAAACTG 
. :' ■ ACCCTGAMGCTCnACATGTMGAGnCCAAAMTATTTCCAAAACTTGGAAGATTCAT 
nGGATGTTTGTGTTCATTAAMTCTCtCACTMnCATTGTCtTGTCCACTGTCCGTAA 
CCCMCCTGGGAnGGTTTGAGTGAGTCTCTCAGACTTTCTGCCTTGGAGTTTGTGAGAG 
CA.T] 

GATGGCATACTCTGTGACCACTGTCACCCTAAAACCAAAAAGGCCCCTCTTGACAAGGAG 
TCTGAGGATTTTAGACCCAGGAAGAATGAGTGATGGGCATATATATATCCTATTACTGAG 
GCATGAGMGAGTGGMTGGGTGGG1TGAGGTGGTGTTTTAAGGCCTCTTGCCAGCTTGT 
TTAACTCTTCTCTGGGGAACGAGGGGGACAACTGTGTACATTGGCTGCTCCAGAATGATG 
TTGAGCAATCTTGAAGTGCCAGGAGCTGTGCTTTGTCTATTCATGGCCCCTGTGCCTGTG 

2612 TGAGTTGGAACAGTTTGATACCAAAACCATCCCCCCGCCCCCCAACCCCCAGCCTAGGGT 

CCGTGGAAAAATTGGCCCCTGGTGCCAAAAAGGTTGAGGACTGCTGATCTAGAGGACCAA 
TnAnCMTGnGGnGAGTAMTGAGCTCTTGGATTAGGTGATGGAAAAATCTGAAAA 
MCAGGGCTTTTGAGGMTAGGAAAAGGCAGTMCATGTTTAACCCAGAGAGAAGTTTCT 
GGCTGTTGGCTGGGAATAGTCATAGGAAGGGCTGACACTGAAAAGAAGGAGATTGTGTTC 
[G.A] 

TTTCncnCTCAGAGCTATMGCAAAGGCTCAAAGnCTAGAAAMGGCMGTTrrGTT 
TCAGTAGAAAAMGGATMTCAGMCCATTTTTAGAAMTGGAATGAGACTACTTTTGAG 
GCCATGAGTTCCTTGTCCCTGGAGAGATGAGCAGAGGTTGGACAAGTGCTTACCAGAGAT 
CTTGTGGAGGCAGAMCTGTGCATCTAGCAGAGCATTGGCCTAACCCTTTCAAATGAGAT 

6CTGTTMCTCAGTCTTATTCTACATGGTAG6AATCCTGTCCGTTTGCCTCCT6CTACTT 

' 5080 ACMCGTAA^TAGnGAMTnGTTGGTGGAMGAAGAGCAGTCCACTCCAGAGGCTGG 

ATGGGCATGCCTGGCCCCCAAGGTCTGAAGTGGTAGGGCTGTGCCTATATCCTGAGAATG 
AGATAGACTAGGCAGGCACCTTGTGCTGTAGATTCCAGCTCCTGCACATAGCTCTTGTTG 
TAAMCATCCCTGTGCnATACCMGTMTTGAGTTGACCTTTAAACACTTGCCTCTTCC 
CTGGGAACCATATAGGGGATTGGCCTGGAGACGTCTGGCCTCTGGAAGAGTTGGAAAGCA 
[G.A] 

CCATCATTAnATCCTTTCCTTTCAGCTATAACTCAGAGCTCTCAAGTCTTTTCTGTGGA 
TCTTAnGCCnGGnCTTGCCCCTTTTACTCCCAGGGMGnGATTCTGTCTTTTCTGT . 
TCCATTTAGTATGACAGGAGCAGAGiMTGTCAGAGCTGTAAGGGACCTTATAGTTAAAGC 
CTTTGGCTGGTCCTnCATTTTATAGCTGGGACTMTMGTMCGTCAAAACCCAATGAG 
nCACAGAnGGGTCTCGCCnGGCATGTMCCCATATGnCATATTCTTGCTGTTTTCC 

6599 CTGTMTCCTAGCACTCT^GAGGCCGAGKAGMGGATCKTTGAGCCCAT^ 

GAGTTTGAGACCAGCCTGGCCMCATGGCAAMCTCCACCTCTACAAAAAATACAAAAAT 

attagccaggcgtgatggcacacacctgtagtcccagctacttgggaagctgaggagcga 
tgattacctgagcccagggatatcaaggctgtAgtgagctgtgatcatgccactgtactc 

CATCCAGCTGGGGGACAGAGTGAAACCCCTGTCTCAAMCAAMCAAATGAAAAAAAAAA 
C-.A.C] 

CCTTAATAATCAGTAACTGTCACTTTATATTATGTTGTGAGTGTGTGTCTATATACACCT 
ATATGTATACATTTCTCTTATTACACATTCATTGGTGATCTGATGTGGAGCCCCAGGGAT 
TMGGGCMCTTTGAACTACCCTGACACAATCAAGCCAAATATCATTCCCGTGGAGGAAG 
TAGAGTATCTAGGTTCTGTCTCCTAGlTGCAGCmACCTTGAGGACAGAGACTCTAATC 
CAGCTGTGCTGMGGAGCACATCTCCTGACnCTGAGCTTTCCCCTGGTAAATTCAAACT 
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6983 CACATTCATTGGTGATCT6AT6TGGAGCCCCAGGGATTAAGGGCAACTTTGAACTACCCT 
GACACAATCAAGCCAAATATCATTCCCGTGGAGGAAGTAGAGTATCTAGGTTCTGTCTCC 
TAGTTGCAGCTTTACCtTGAGGACAGAGACTCTAATGCAGCTGfGCTGAAGGAGCACATC 
TCCTGACTTCTGAGCTnCCCCTGGTAMnCAMCTGGATGTCACGGCGCCCTCAGATA 
GAGCCTGGTMTTTGCCCTGGGGAGAGTGACTGTCTTTTGGATCTMTTTi^CTTTTGCC 
[C.G] 

CAGTTGGAGGAAMTCTTCAGGGCTAGGAAGGATTGTATTTGTCTGACCCCAGAGATAAC 
CTGGGTTTTGAGGAACATGGGGCATCAACCTGAATGGTCTTGTAAGATCTCTCCCACGCC 
AGCTTGCCAGTGTTTCTCTGATGAATTTAGAGTACCTGAGTAGTGCAGGCCTGCTGGGAG 
GAGGACTCTCCCTCTGTGCTACTCAGAGAAATTCATTC7TCAAGGCCCCCTTCCAGCCTT 
GCTCTTACCCAGCTGGGCTACAGTTACMTAMGGAAATGACTTTTCTTCTCCCCTTCCC 

GGCGTGCCACCACACCTTGCCA 1 1 II 1 1 1 1 1 ATTTTAAGTAGAAACAAGGTCTTATTAAT 
ACTATGTTGCCCAGGCTGGTCTTGAACTCCAGCGATCCTCCTGCCCCAGCCTCCCAAAGT 
GCTTGGGATTACGGAAGTAAGCCACTGTGCCTGGCCAGTGCAACCCCCATTTTATACTAA 
MCAGGMGGCCCAGAMGGTTTGGAGTAACTTGTCCAGGGTCACACAGATGATATTTGA 
ACTCAGGTCTCCCTGGCTCCCAAGAGAGTCTGCTTTCCACTAGGACTCCCAGGAGAAAAA 

[A.-] 

AAAAAAAAAMCAGTAGACTTGGAGACAGAAMTCTGATTTGAGTCTTAGTTGAGCTAGG 
CTAACTGTGTAACTGTGGGCAAGTTCCTTAGCCCCTGTGAGCCTCAGTTTCTTATCTGTA 
AAATGTCATAAAAGAAATCCATCTCATGGAGTAGTTGTGATGATCAAGGACTCTGAAAAC 
ATTAGMTGGTTTAATGTGAAGGATTAGCAGCAGCACATGGCAACATTGTGCATCTTATA 
TTMCTATCCAMTATATCMGCGTCATTTGCTATATATAAAAGTCATCAAATTAGGCAC 

ACnGGGAGGCTGAGGCAGGAGAATCACTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCC ; 
CAGATCACGCCACTGCACTCCAGCCTGGTGACAGAGTAAGACTCCATCTCAAAAAAAAAA 
AAAAAAAAAAAAATTCCTTAATTTGGCCTACAGTAGAGCCCTCCGTAATGTGGCCTCTCT 
CCACATCTCCACAACCTCCTGCTCCCTGCACTTCAGCCTCACCTCTCTTCTGGACAGGCC 
CTCCTTCTGACAAGGGCTTTGTTCATTCTGCTCCCTCTGCCTAGAATGCCCCCTTACTCT 

[G,t] 

TTCACTTMCTCCTGCnATCGmAGATCTTTACCTGGATG GCTCA GAGAAATATAGAA 
GTMnCCTCACCCTGAAAMTAGGTTAGGTCCCTGTTTTATGTTnCATAGACCTTTCC 
TTTGAGGCTTTTTTTAAAAMGTAGTTnMTCTCACATTTATTCATGTG ATCATC TCCT 
TMTGATATC1TMGACCTCTMTAGMCAAT1TGGTCATGGACTGTGGGGTTTTTGCCC 
CTCATTGTGTCAGCACTGAGCATATTGTTGGCATAGGAGGGATATTTGTTGAATGAATTG 

17707 GTAGTGGGTGCTCAGAGTGTTTGCTGGGTGMTGATGTATTTGTTGAACGACTCTTTGGA 
CACTTGMTAAAGTCCATCCAGTATGCACCATTACCATCTCTTCGCTCTACAATATTCTT 
TTAGGCAAGAGCnATCTITTGAGGTGATAAGATAAGCTCAAACTTATGTAGACTAAGAC . 
CTCAGTCTGTAMTGTCATCCCTMGTCTTAAACCAtCAAAACCAGG GCCTCAAGGAATG 
GCATGKXTTCTGCMCTGTAGCMCCTGC^^ 
[T.C] 

CCCAAAAGCTAGAGTCCCnCTCCCATGGGCAGTGCTGGMGTGTGCTMCAAATTCTTT 
CTCCATACTGCnACGATTACAAAAAAAACCCTCAGCATCTCATGCCAGACTTGAGTTAA 
GGnGTTTTCTTTTGTGTGTCAGCTGTATTCTGGTCATGACTTCCTGATGATGCCCTATA 
GAGATTTTGCTGAGATCAGAGGGTGCTCCACTGCCATCAGTAGCACTGACTCTTGCAGAA 
GCACCGTnCTGMGTTGGCTMTGTCATCCCTCACGTnGTTTGTnGAAATTTGTnT 
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18219 TGCCATCAGTAGCACT6ACTCTTGCAGAAGCACCGTTTCTGAAGTT6GCTAATGTCATCC 

CTCACGmGmGTTTGAMmGTTnAGnCCAGAGATAGCACTTTCATGGAATGAC 
GCTATCTTCTAGAATCAC 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 GAGTTGGAGTCTCGCTGTGTCGCCAGG 
CTG^GTGCAGTGGCACMTCTCAGCTCACTGCAATCTCCACCTrCCGGGTTCAAGTGAT 
TCCCCTGCCTCAGCCTCCCGAGGAGCTG7TACTACAGGCGCACACCCCCACTCCTGGCTA 

[-.A] 

TTTTATGTGTTTTAGTAGAGACGGGGTTTCACCGTGTTGGCCAGGATGGTCTCGATCTCC 
TGACmGTGATCTGCCTGCTTCAGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGTCAC 
CGCGCCTGGCCTAGMTCACCTTTTTATACCATAACGTGAGCACCACTGCCGCGTCACCA 
AGGAAAGAGAGAGGCAGCTACTGTGGGGTTACAAATGGGTAAGAGTGGCACCAGGAAGGT 
GAMGTCTCTACTTAGCCMGGCnMCAAMTGTCMTCACCAMCATTTATTTATTAA 

19670 GACCCCCATGATGAGCMCTATAGCACTAGMCAGTGATAATMCTAATGTTTATAATGC 

ATCTTCAGTTTACAGAGGGCTTTTGTACTCATCATCTAGlTrAGTTCCTGCAACAACCTC 
nGAGGMTATAGCACMGCAGGACMGGGMGCCCAGAGATGTTAMTMTTTATCCAA 
GTTTATGCTGCTGGGMGGGCAGCACTGAMTTAAMGAAMGTTTTCTGAGCTCAAATC 
CCATGCCCTTTCCTCAATGTGAGCTCTAGCAAGGTATTCAGGAATCCTGCCTCTACAGTT 
[C.T] 

AGAGCCTCAMnGCTGGGTATGnGAGncnGTATCTGATTTnCTAGATTTCCTGCC 
CACATTCTTACTGTCTGGATATCAGGAAAGAGT7TATCAAATGCCTGTGGAAATCCAAGA 
TAAGGTCTCATGATGAGTAACCCAGTGAAAACATGAAGTCAAGTCTAACTAGTCACTACT 
ATnCACTACTGCTGACTCCTGATGATCAGCTCCTTTTCTMGTGCTTACTGTCCACTTA 
nCCATCATCrGCCTAGMmATGTGMGGAATCAAAGCAAAAGGATCATAAGGCnCC 

-21153 GGACCCnGTTTTAGMGGATGACTGCTGCTATMTGTAGAMGTGATTTG^^ 

AGGAGTGGGGCACGAMGATGGTTAGTAGATGGGGGTGGTMTGCnACCTTTCAGTATT 
TGGAGGGTCGGAGTCCTCAAAMnCTCTTCCTTGATTGGAGTCCTCCCAGCCAATAGA 
GGGCnCACACAMCAGTnCTTGGGTmGMnGTTTGACCAGAGCmcnCCGACA 
AMGGnGGGGTGATTCAnCACTTACCACACCnGCCTGMCATTCACTTGGGGCTGCC 

CG.T] 

GTTATGMGGCTATTGTTCTCCAGCCTGTCACAGACGCTTTGAAGACCTGTGCCTCAGCT 
GGTTCTMGGAGTCAGTnGnCAGCTCCGTGCCAGGTTTCCAACTTATGAAATGTGCTG 
GAGATTMCACCTCTCCTGCCATTTTATCCCTACTATMTTGCCAGTCAAAGGATTCCTG 
CAGTTGCGTCTGGCAGCCATAACTGATGAATGTTCTGCCAGCTGCTCTGAGGACCTAGAA 
GAGCAGTTTTCTATCCAGGACCAGTTTCCAAGGGTGGGAGGGTGAAATATATCCTCCAGT 

24566 CTACTCTGGAGGCTGAGGTGAGAGGATCACTTGAGTCCAGMGGTCGAGGTCMGATTGT 

AGTGAGCCATGATGGCATCACCGCACTCCAGCCTGAGTGACAGAGAGAGACCCTGACTCA 
AAAAAAAAAAMCAVWWW^AAMCACCCTCACCACTTATCAGCTAtTTGTCTTGAGAA 
TAGTGACATMCCCCTCAGMCCTATnCCTMTCTGTTAMTGAGGCTGATGACGTTTC 
CTCCTTnACTGGCMTnAMCATGATGGATMTAMTGCTMGCACTTAACACAGGGC 

CC-] 

TAGMGATAnMCTGCTCMTAMTGGTAGCTTCTTAACAGTATTCAAACCCATGTGCT 
CnATCACATGCAnGTTGTCCCTGTGTCCAGTTGGTGGAATGGGAAAAGGCTCCCTTGT 
MCCCCATCTACCATCmATCAGACmCCTGCCATGGTTCACAGTAAGAGATAGAAGC 
TGCACGIGTGACnCTGGCTCTrTACAATGGTGAGCGGTGTGTGCCTGGTAAGGGAGAGCT 
GATGTCACTGCCCCAMTCCAGTAGTGAGATCTGAGTGnCTGGTTTCCTCCAGCAGCCT 
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26604 6ATTTGCAGCTGAGCCTGTCTATCTGGTGTGGGAAGAAGATGGGGAGTTACTTGTCAGTC 

CCGGCTTACTTCACCTCCAGAGACCTGTTTCGGTGAGTTGGTCTCCGAGTTCCCCTCTCC 
ATCTCTCCTGGCCCCTGGTCCTGAGAGGAGGGTGGTCTCCCTAAATCTCCTTCTCACTTA 
GTCCTrrACCATCGGTTCTGCCGGGCAGAAGCCAGCGGAGGTTATACCCAAGGAGAATCG 
GCCTTGTGAGGTACCCeCATTATGTCCTGGAAGTGGTGAGGGGAGGGATATACCCAGAAG 

CG.A] 

MCTTCTTAGGGAGCTCCAGCTCCCCTTCTATCCCAGACAAACCTGAAGGAGCCTCCAAA 
AGATGCCACTGACCTGCCCATTGTAGATGTTACTGCTTCCGGGGGGAATAGCCCAAATAG 
AGTGCTGTTTCCAGCTCTCACATGTCTTACCTGCGGGCCATGCTGCCTGCCCAGGAATTT 
GTCCCAACAAGCAGGATGGGCAGGTTTTGCCAAACTGTGGAAACTGGCAAGTCCTGGGTG 
TGGGTAGCCTGGTACACAGTAGGCACCTTATAAACGTTTGTTCTCTTAATGGCAGGCACA 

27255 TGGGGAAAGACGTGGGCGAGTGCTTCTAAGACTGGAGCAATGGGCTTTAGAGTGTTCCTG 

AGCTGCTGGGCCAGCCeCCACACCTCCTCAGTCCCTAGGCCTAAGTACCTCCACGAGCCT 
CTCTCTGTGGGGCTTCTCAGAGGGAGATGTGGAMGTCTACCTCTMCCTGGCTTTC'nT 
GCTCATTGCCCCACTCCACCTCCCATAGAAACTCCCCAGGGGGTTTCTGGCCCTCTGGGT 
CCCnCTGMTGGAGCCAnCCAGGCTAGGGTGGGGTnGTTTTCAnCTTTGGGAGCAG 

[C.G] 

CTGnGnCCAAAAAGGCTGCCTCCCCCTCACCAGTGGTCCTGGTCGACTTTTCCCTTCT 
GGCTTCTCTAAGCTAGGTCCAGTGCCCAGATCTTGCTGCCGGGATACTAGTCAGGTGGCC 
AGGCCCTGGGCAGAAAAGCAGTGTACCATGTGGTTTTGTGGAATGACCGGACCCTGGTAG . 
ATTGCTGGGAAGTGTCTGGACAGGGGGAAGGGGGAAGGGAACTGGTCCTCAATGCTGACT 
CTACCMGCGCCCTGCTAGACACTTTATCCT1TMTCTCTCAACAGCCTAAAGAGATTAT 

27399 AGATGTGGAAACTCTACCTCTMCCTGGCTTTCTTTGCTCAtTGCCCCACTCCACCTCCC 

ATAGT^MCTCCCCAGGGGGTTTCTGGCCCTCTGGGTCCCnCTG^ATGGAGCCATTCCAG 

GCTAGGGTGGGGTTTGTTTTCATTCTnGGGAGCAGCCTGnGnCCAAAAAGGCTGCCT 
CCCCCTCACCAGTGGTCCTGGTCGACTTTTCCCTTCTGGCTTCTCTAAGCTAGGTCCAGT 
GCCCAGATCTTGCTGCCGGGATACTAGTCAGGTGGCCAGGCCCTGGGCAGAAAAGCAGTG 

CT.C] 

ACCATGTGGTTTTGTGGAATGACCGGACCCTGGTAGATTGCTGGGAAGTGTCTGGACAGG 
GGGAAGGGGGAAGGGAACTGGTCCTCAATGCTGACTCTACCAAGCGCCCTGCTAGACACT 
TTATCCTTTMTCTCTCMCAGCCTAAAGAGAnATATATCCCC ATTTTA CAGATGAGGC . 
MCCAGTTTCMCAGAGTTMCATATGGAGCCTCACTGGGCAGCTTTTTCTGTCTTCCTG 
ACTTTCTCTCATCCn"CAGGGGGCTGCAGGTnGTnTCnCTCCTAGTGGAGAGGAAAT 

28088 MGAGCCMTGGAMTTGATCnGAGTTTAGGAGAM 

GCCAAGTGTTGMGTAGCCACATnCAGGTCCTCAnMTTTCTCTTAATCCTGGGAAGG 

CAGCTTAGGAGMGGGTTGnCCTKAGGAGCCAGGMCTATACCCCTTTTA 

GAGGCAGGGMGCCAGGGAGGACACAACTTCTCAGGAAGAGGAGAAGCTAGAGCAGATAG 

TGAACTCTCAACCTGAACCTTTAAGGGCCAGACCAGTAATGCCACCCAAGTCCACCTGCC 

[G.A] 

THGTCTTGTTCTGTCCCAGGCTTTCTGGAGAACCTGATCTTCTTGCCCCTACCCCCAAG 
CTCCGTTTGCCCAGCTAGAGTCTGGGGGGTACTGACTGACTTTCGTAGACATTCTTCCCT 
TCCCCAAATAAGAGGCCACATTCCTGAAGTCACTTCTGAAGAGATAGCTGCCACACAGGG 
CTCTTTCCCCCCAGGGAGGGACCACCCAGACCCTCTGCTCTCCCAGGTATCCGTTACCAC 
ATCACTACCTGGTCAGAAAGCTGTnCTGCCATTAGCCCCTCCCTCTTTTATTATAGGAT 
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28734 AAGTAGAAGCTAGACTTCTTGGGCTCCTGAACAGGGTCCTTGCTGGATTCTGTGAAACAA 

ATTAAGTTCTTGACCCTA6GCCTCTGGG6GAGTACAAAGTCTAT6GGAG7TCTGGGGCT6 
TGGTTGCAAGGAAAGTGACGCAACCAGATTCCATGGGGACATGATCAGGCGTGACATGTG 
AGGGAGGAAGAGGGAGCAAGGGAATGAAGAATACAACTTCTGTGTCCCATACACCCCTGC 
CTGACAGGCCATACATACTCAGCAGAGAATGCACTGTCTTTCCTACCACACTAGCGtGAG 
[G.A] 

AGTGAGCTGCMTTACCACTGTGCTTCCMGTAAGAAAATACCTCAAATTGGAATTTACA 

AMGAGGTAMTTAGGGAGTGGCTWGTCGGACATCTTTA^ 

GMmCACTTMTGTCCMTACTGAmMTGAGCTTGGGTTTACACATTATCTCTTGA 

AGAAMCAMTGMCCTTrGTGnCCAMGCMTCCATGTTTAAAGGGAAAAAATTATGC 

ATAACTCTGCCCAGCTTCACAGTAACCTTTGGCAGGTGCCTTAGGTCCTCTGGGACTCTT 

29246 MTCCATGTTTAMGGGAAAAAATTATGCATAACTCTGCCCAGCTTCACAGTAACCTTTG 

GCAGGTGCCTrAGGTCCTCTGGGACTCTTTTCCTTATCTGAAAAATGAAGGACTTGGATC 
AGGTGMTGGTTCCCAGCTCTGCAACTTATGTGGCTCCTCAGAGGCACACAAGCTCTTTT 
CCATTATTTGCCAAATAATGGAGGCCCTGTCTnAACTGCAGTACAACTACACAAAATAC 
nGAMCTACAGTCnCCTGGTTTTTGGTTGGAACTGAATCAGTGCACTCTAGCAACACT 

C-.T] 

ATTTCTTGCTGTTCGTAGGCTTCATTATGTGTnGGTTMTTnTTAAMCAACMTMC 

ATATTCCATMTMnAC AGCTTM TTGGCAGACTGTTTCAGTCTATAGGATCTGCAGGA 
AGGAGGAGTMTAMGGGATrrTTGACTGAGCTCTTATGGAACAGAGTCTCTCTAGGCCC 
CTGTCATATCTGCCCTTCTGGGCCCTGGGGAAAAGTTGGCATCCCCAGTTGTGGTGCTCT 
CCAGGTGCCCTCAGGCTGTGGTGGAGGGAGCTTCCCATTCTCTCCTTCAGGCCACTCAAT 

29490 MCTACAGTCTTCCTGGT7TTTGGTTGGMCTGMTCAGTGCACTCTAGCMCACTTATT 

TCnGCTGnCGTAGGCTrCAnATGTGTTTGGTTMTTTTTTAAAACAACAATAACATA 
TTCCATMTMTTACAGCTTAATTGGCAGACTGTTTCAGTCTATAGGATCTGCAGGAAGG 
AGGAGTMTAMGGGATTTTTGACTGAGCTCTTATGGAACAGAGTCTCTCTAGGCCCCTG 
TCATATCTGCCCTTCTGGGCCCTGGGGAAAAGTTGGCATCCCCAGTTGTGGTGCTCTCCA 

[G.A] 

GTGCCCTCAGGCTGTGGTGGAGGGAGCTTCCCATTCTCTCCTTCAGCCCACTCAATTCAG 
AGGCTAGGGGCTGAAAGAAGCTTCTCTACAACTGGCTGTTCACTGGGAGGTTAAGGGATG 
ACCATCCAGCCAGGCCTTCCTCAGGACATGGGAGGGCTTATGCTTTAACATGTGTAAATC 
CACTGCMTMTGACTGGTTCTTnACCCCATMGGTTGAGMTTTACCTGTAAACATTT 
nGTCTGMGMTTTGGATGTAAGTGAGGGCTGGGCCTCTATCTTATCTCACTTGGCTTC 

29934 GGACATGGGAGGGCnATGCTTTAACATGTGTA AATCCA CTGCAATAATGACTGGTTCTT 

TTACCCCATMGGnGAGMTTTACCTGTAMC^^ 

GTGAGGGCTGGGCCTCTATCTTATCTCACTTGGCTTCTGTCAGCACAGCACCTTGCCTGC 

TTGncnACACATCCTAGATGCACAGTMCTATTtCCTMTTATTAGMATCTAnAGA 
ATCMTTGATTTCAGCTGGGCTTGGTGGCTCCTTCCTGTAATCCCAGCACTTTGGGAGGC 

[T.C] 

AAGGCTGGAGGATCACCTGAGTCCAGGAGTTTAAGACCAGCCTGGGCAACATAGGGAGAC 
CCTGTCTCTACAAAAAATAAAAAATTAGCCAGGCATGGTGGTGTGCACCTGTAGTCCCAG 
CTACTCAGGAGGCTGAGGCAGGAGGATCTCTTGAGCCTGGGAGGTCAGACTACAGTGAGC 
AATGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGTAAGACTCTGTCTCTTAAAAAA 
AAAAAAAAAAMGnGATnCTATTrGGATAGATAMTMTTCATTTTAGGACCT^ 
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34480 CTGACTTCAAGTGATCCACCCGCCTCGGCCTCCCAAAGTGCTG6GATTATAAGCATAAGC 

CACTGTGCCCAGCTGCTCTCTATATTTTTMTACATATTATTTCCAnMTTTTCACAGC 
AGnCATTTTATAGATGAGGAAACTAGGCCAGAGAAGTAAAATATCTTGCCCAAGATGAT 
GTMCTAGTMGTGGCAGGATCMGATTCAAACCAAGCAATGTTCAAACCTCTTGGAAGC 
MGMTGTGGCCACTGTGGAAGGTGCAAGGCCTTGACAACAAGAATAGGGAAAAGAAGGA 

[A.G] 

CTAGAAGGAAAGAGATGGCATGGGCTCAGCAGGCCAGGGAGCTCTTAGCTGTGTGTGTTG 
GGAAGCTCAGAAGGGAGGAAGAGGTTGTCTGTGCAGGTMGTCCTGAGAACAC ACCAGAC 
TTTTGAGAGGTGGAGCTTCATAGCCAGGTCATTAGGGGAGAAGGGAGCTATAGA 1 1 1 1 1 1 

1 1 1 1 1 1 1 1 1| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A GAGACGGGGTCTTACTATGTTGCCCAGGCTG 
GTCTTGAACTCCTGGGCTCAAGTGATCCTCCCACCTCAGCCTCCCAAAGTGCTGGGATTA 

38812 AMTCCAGCAGATCCATTGAGAGTITTMGCAGCMGGTGnGTGACCMGTTMCAlTn 

AGAAGGATCACTGGTATGGAGGTTGGATTGGAGAGGGGAAAGCCTAAAGGTATAGAGACT 
AGTTAGGAAGCTATTGTAGGCTGGGCATGGTGGnCATGCCTGTAATCTCAGCACTrTGG 
GAGGCTGAGGTGGGAGGATTGCTTGAGGCCAGGAGTTGAAGACCAACCTGGCCAACATAG 
CMGACCCCGTCTCTGTTTTTCnMTTAAAAGAAAAGTCCAGACGTAGACATAGTGGCT 

[T.C] 

ACGCCTGTMTGCCAGCACTTTGGGAGGCCAAGGTGGGCAGATTGCTTGAGGTCAAGAGT 
TTGGGATTAGGCCAGGCGCAGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG 
GTGGGCGGATCACAAGGTCAGGAGATCAAGACCATCCTGGCTAACACAATGAAACCCCGT 
CTCTACTAAAAGTACAAAAATTAGCCGGGCATGGTGGCGGACGCCTGTAGTCCCAGCTAC 
TCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCTAGGAGGCGGAGCTTGCTGTGAGCAGA 

4073 1 GTT CTGTCCTAT(nCTGTC^^ : 

TAGGAAAGGAACCAGCTGGCCAGGGACAGACTATGAGGATTGTGCTGACCCAGCTGCCCC 

TGTGGGGATCACAGTTTACAGCCAGAGCCTGTGCGGACCCAGCTGTCTGCCAGGTTTCCT 

TAGAAACCTGAGAGTCAGTCTCTGTCCACTGAACTCCTAAGCTGGACAGGAGGGAGTGAT 

GCTAAACCCTGAAGGGCAACATGGCCTATGGAGAAAGCATGGAGCTCAGAGCCTGGAGTA 

[C.G] 

GGGCACAGATAGGATTGAATAAATTGTGTAGAMGACTnGAAMCMTAAAGCAAAAGA 
TGMTGMCGTTTTTTnAGACTTGAGGGACCAACAACCCCCAAACCCCAGATTCTGCCA 
GGTCCATGGGGAAGGAGAAGTTGCCTTGAGTGGAAGCCCCAAGTAGGGAGACTTACAGAA 
AAGAAGTCAAGAGCACTGGCTCCCAGGCAGAAATACTGATACCCTACTGGGGCTTCAGGC 
TGAGCTCCTCCCTTCACAAATCACTTCATCTCTCTGAGCCTGTTTCTGCATCTGTGACAT 

41303 CTCTGAGCCTGTTTCTGCATCTGTGACATMGATGGTMGATAAAGGTGGCTGTCTCACC 

MTTATGTMGGAnAAATGTGGAAAAGGACATAAAGTTGTATAGTGCTGCCATAGGGAC 
AGTGnCAGTAMCGTGACACATTCTTAGTATCACTAAGAATCAGGTTCTrGGCCAGGCA 
CCGTGGCTCATGCGTGTMTCCGMCACTCTGGGAGGCCTAGGTCGGAGGATGGCTTGAA 
CACAGGAGTTTGAGACCAGCCTGAGCMCATAGTGAGACACTGTCTCTACAAAAAAAAAA 

[T A] 

MtMTMTMTrGTTTTTAATTAGATGGGCAGGGCACTGTGGCTCACACCTGTAATCCC 
AGCAC1TTGGGAGGCCMGGCCGGAGGATTGCTTGAGGCCAGGAGTTCAGGAGCAGCCTG 
GGCCACAnCCTGTCTCTACAMGMTAAAAMGTTMCTGGKaCATGGTGGCACATGCCT 
GTMTCCCAGCTACTCAAGAGGCTGAGGAGGAGGATTGCCTGAGCCCAGGAGTTCAAGAC 
TGCAGTGAGCCTTGATCACACCACTGTACTACAGCTTGGGCAACAGAGTGAGACC7TGTC 
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41305 CTGA6CCTGTTTCT6CATCTGT6ACATAA6AT6GTAAGATAMGGTGGCTGTCTCACCAA 

TTATGTAAGGATTAAATGTGGAAAAGGACATAAAGTTGTATAGTGCTGCCATAGGGACAG 
TGTTCAGTAAACGTGACACATTCTTAGTATCACTAAGAATCAGGTTCTTGGCCAGGCACC 
GTGGCTCATGCCTGTAATCCCAACACTCTGGGAGGCCTAGGTCGGAGGATGGCTTGAACA 
CAGGAGTTTGAGACCAGCCTGAGCAACATAGTGAGACACTGTCTCTACAAAAAAAAAATA 
[-.A] 

TMTMTMnGTTTTTAATTAGATGGGCAGGGCACTGTGGCTCACACCTGTAATCCCAG 
CACTTTGGGAGGCCAAGGCCGGAGGATTGCTTGAGGCCAGGAGTTCAGGAGCAGCCTGGG 
CCACATTCCTGTCTCTACAAAGAATAAAAAAGTTAACTGGGCATGGTGGCACATGCCTGT 
AATCCCAGCTACTCAAGAGGCTGAGGAGGAGGATTGCCTGAGCCCAGGAGTTCAAGACTG 
CAGTGAGCCTTGATCACACCACTGTACTACAGCTTGGGCAACAGAGTGAGACCTTGTCTC 

CTAAGAATCAGGTTCTTGGCCAGGCACCGTGGCTCATGCCTGTAATCCCMCACTCTGGG 
AGGCCTAGGTCGGAGGATGGCTTGAACACAGGAGTTTGAGACCAGCCTGAGCAACATAGT 
GAGACACTGTCTCTACAA/WAAAAMTMTMTMTMTTGTTTTTMTTAGATGGGCAG 
GGCACTGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCAAGGCCGGAGGATTGCT 
TGAGGCCAGGAGTTCAGGAGCAGCCTGGGCCACATTCCTGTCTCTACAAAGAATAAAAAA 
CG.C] 

TTAACTGGGCATGGTGGCACATGCCTGTAATCCCAGC'TACTCAAGAGGCTGAGGAGGAGG 
ATTGCCTGAGCCCAGGAGTTCAAGACTGCAGTGAGCCTTGATCACACCACTGTACTACAG 
CTTGGGCAACAGAGTGAGACCTTGTCTCCAAAAAAAAAAG 1 1 1 til 1 1 1 1 1 1 1 IATCCACT 
CTCCTCACCAAACAAACTGAGTAAG7TAGAGCCCTCTCAGCTGGCATGTGTTGGAAACAG 
TGCCCTCTCATTAAAGTGCTGCCCTCACTCCCATTGCCTCTTGGCCTTGGTCAGTATGAT 

AGCTACnGGWGGCTGAGGCAGGAGMTCGCnGMCCTGGMGGCGGAGGTCGCAGTG 

AGCCGAGATCGTGCCATTGCACTTCAGCCTGGGCGACAGAGCGAGACTCTGTCTCAAAAA 
TAATAATAATAACAATAACTAGCCGGGCCTGGTGGCACATGCCTGTAGTCCCAGTTACTC 

AGGAGGCGGAGGCATGAGACTCAGGT6MCTAGGGAGACAGAGGTTGCAGTGAGCCAA6A 

TCACACCACTGCACTCCAGCCTGGTTGACAGAGCGAGACTCTGTCTCAAAAAAAAAAAAA 
[A.-.T] 

CCCATTTGCTCAI 1 1 1 1 IGGATACTAGTATAACTATCACTCTAAACCAGTTAGTACTTAA 
ATCAAGCAGATATGGGAGATGGTGAATTACCATCTACAGTGTTGTCATATATGTCACATA 
CTGAGCATTATCAGCTAGTAGAATCTAGTTAATTGTTCTATGTGTGATGTATGCAGAGTT 
CCCATTTTGMTGTGTTTTTACTATGCTTAAATAAATGACTGATGTCAGCAACCCCAAAA 
TGATACATCTGATGTAAGAGCCCCTGTTCCCCAATAATAACATCTAAACTATAGACATTG 

43357 AGGCATGAGACTCAGGTGAACTAGGGAGACAGAGGTTGCAGTGAGCCAAGATCACACCAC 

TGCACTCCAGCCTGGTTGACAGAGCGAGACTCTGTCTCAAAAAAAAAAAAATCCCATTTG • 
CTCATTTTITGGATACTAGTATMCTATCACTCTAMCCAGtTAGTACTTAAATCAAGCA ; 
GATATGGGAGATGGTGMTtACCATCTACAGTGTTGTCATATATGTCACATACTGAGCAT 
TATCAGCTAGTAGMTCTAGnMnGTTCTATGTGTGATGTATGCAGAGTTCCCATTTT 
CT.G] 

MTGTGTTTnACTATGCTTAAATAAATGACTGATGTCAGCAACCCCAAAATGATACATC 
TGATGTAAGAGCCCCTGTTCCCCAATAATAACATCTAAACTATAGACATTGGAATGAACA 
GGTGCCCCTMGTTTCCTCCCTCCAGGGTTTCTTGGCCGGTCTCTGAGGACTACACATCC 
CTACTCCCGTCTTTCCTCATCTTCAGGCGCAGTAACAGTATCTCCAAGTCCCCTGGCCCC 
AGCTCCCCAAAGGAGCCCCTGCTGTTCAGCCGTGACATCAGCCGCTCAGAATCCCTTCGT 
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45664 CCAGCTTTCCTTGGCTTCCCCCACCCCCAG6TGAAAGT6ATGCGCAGCCTGGACCACCCC 

AATGTGCTCAAGTTCATTGGTGTGCTGTACAAGGATAAGAAGCTGAACCTGCTGACAGAG 
TACATTGAGGGGGGCACACTGAAGGACTTTCTGCGCAGTATGGTGAGCACACCACCCCAT 

AGTCTCCAGGAGCCTTGGTGGGTTGTCAGACACCTATGCTATCACTACCCTAGGAGCTTA 
MGGGCAGAGGGGCCCTGCTTTGCCTCCAAAGGACCATGCTGGGTGGGACTGAGCATACA 

[T.C] 

AGGGAGGCTTCACTGGGAGACCACATTGACCCATGGGGCCTGGACCACGAGTGGGACAGG 
GCTCMCAGCCTCTGAAAATCATTCCCCATTCTGCAGGATCCGTTCCCCTGGCAGCAGAA 
GGTCAGGTTTGCCAAAGGAATCGCCTCCGGAATGGTGAGTCCCACCAACAAACCTGCCAG 
CAGGGCGAGAGTAGGGAGAGGTGTGAGAATTGTGGGCTTCACTGGAAGGTAGAGACCCCT 
TCCTATGCMCTTGTGTGGGCTGGGTCAGCAGCTATTCATTGAGTTTGTCTGTGTCACTG 

47549 AATTAGCTGGGCGTGGTGGTGCACGCCTGTAGTCCCAGCTACTCAGGAGGCCGAGGCAGG 

AGAATAGCTTGAACCTGGGAGGCAGAAGTTGCAGTGAGCCAAGATCACACCACTGCATTC 
CAGCCTGGGTGACAGAGTGAGACTTCATCTCAAAAAAAAAAAAAAAGAGAGACTGATATG 
GTTAGTACATrGGGGTGGMTGCGGAGGGTCCAGGGAATGGAGCCCTGCATAGGGGGCTA 
ATGAMCATTTCAGATTTCTGAATTAAGGTAGTGGCTGTGGGGACAGGAGCCTGGGAGGC 

[A.C] 

GGGTGGAGTCAGMTGGAGAGACTGGTTGGCAATGAGGGAACAGGAGGAGGAGGAGGAGG 
AGTTACGAGTGGCTTGAGGTGTCACTTACCAGACATTTGGGGGATGGGGGATAGCCGTGA 
TTGTTGAGCAACTGGTTTGGGAAGAGCTAGCATTGATCCCTGCTGTTCTGTGCTAGCAGA 
ACCTATCAGCATCTTCTGGGCAGGAAACTGGCTCCATGAGACTGGCTTAGGGAGAGGCTG 
CTAGTCACCTMtCTGCAGAGAAGGGGCAGCTGGAGCTGTGGGACAGAAGAGGCATCCAT 

-47908 GGAGlTACGAGTGGCnGAGGTGTCACTTACCAGACATnGGGGGATGGGGGATAGCCGT 

GATTGnGAGCMCTGGTTTGGGAAGAGCTAGCATTGATCCCTGCTGTTCTGTGCTAGCA 
GMCCTATCAGCATCnCTGGGWGGAAACTGGCTCCATGAGACTGGCTTAGGGAGAGGC 
TGCTAGTCACCTAATCTGCAGAGAAGGGGCAGCTGGAGCTGTGGGACAGAAGAGGCATCC 
ATGTAGCTGGTGGGGGTGTCTCAGCnGTGMGAGGAGATGGCTTTGAGCAGGGCTGACA 

[C.A] 

TGAAAAGGCTGGAAGAAAAAAACAGACACACAAGAGTCTCAGGATCAGGTAGCATAGGAA 
AGTTGTGGACAGTCTTTGAGGAGCACTCCCTCAGGCAGGCAGGCAGGCAGGTCATGAGCT 
ATAGCGATTCAGGAAGAGCTCCCTGGGTGTGTGAGCAGCTCCAGGAGCCTAAGGGATGAA 
AGTAGTATTGCAGGGGGCTGGAGAGCMGGAGTGGCTCCTTCTACATTTGCAAGGGAAGG 
AGAAAGGAAGTTGCTCCTGAGAGTGGTAAGAGTCAGTGGTGGAGGCCTGGAGAGGAGACA 

52267 TTGTGAGGGGTAGAGGAGAGGAGAGACMGGGATGGnAGGATMTGMGGAATGTTTTG 

TTITrGlTTTTGTTTTTGAGATGGAGTnCACTCTGTCACCCAGGCTGGA 
GCMTClTGGCTCACTGCAGCCTCCGCCTCCCAGGTTCAAGCAATC CTCCTG CCTCAGCC 
TCCCAAGT AGCTGGGACTACAGGTGTGCGCCACCACGCCTGGCTMTTTTrGTATTTT 
GTAGAGACAGGGTnCGCCATAnGGCCAGGCTGGTCTCAAATGCCTGACCTCAGGTGAT 

[C.A] 

CACCCGCTTCAGCCTCCCAAAGTGCTGAGATTAGAGGCATGAGCTACCGTGCCTGGCCAT 
GMGGMGAT7TGTTTTAAAAMnGTTn"CTTTMTAnMTTGAACACCTCTGTTCAG 
AGCACTGGGCTGGTGCCAGAGGGTTTCAGACATGAATCAGATCCAGCACCTCATAGAGCC 
TTAATCTGGCACACACACACAGCCACAAGGAGACACAGACAAGGCAGGGTAGGATGAGTG 
GMGCTAGGAGCAGATGCTGATnGGMCACnGGCnCTGCAGTGMGCCCCTrCTTAG 
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GGCCCCGGCCCCGGCCCCCA6GCCAGGCAGTGGCGGCCAAGGACCACGCATCTACTTTCA 
GAGCCCCCCCCGGGGCCGCAGGAGAGGGCCCGGGCTGGGCGGATGATGAGGGCCCAGTGA 
GGCGCCAAGGGAAGGTCACCATCAAGTATGACCCCAAGGAGCTACGGAAGCACCTCAACC 
TAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACGACTGCCAGGAAGAGGAGATCT 
CAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGGAGAGTGACGATGCCTGGGCTT 

CT.C] 

CAGGGTCAAGGAGCTGCTGGTTGACTGTTACAAACCCACAGAGGCCTTCATCTCTGGCCT 
GCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACCCCAGAAGAAGTGAGGGTCCCC 
GACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCCGACCTCGTAGCAACAG 
CAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATGAGCAGG GCTCCTCG TGCCCCTG 
GCCCAGGGGTCTCTTCCCCTGCCCCCTCAGTTTTCCACTTTTGGATTTTTTTATTGTTAT 

GGCAGTGGCGGCCAAGGACCACGCATCTACTTTCAGAGCCCCCCCCGGGGCCGCAGGAGA 
GGGCCCGGGCTGGGCGGATGATGAGGGCCCAGTGAGGCGCCAAGGGAAGGTCACCATCAA 
GTATGACCCCAAGGAGCTACGGAAGCACCTCAACCTAGAGGAGTGGATCCTGGAGCAGCT 
CACGCGCCTCTACGACTGCCAGGAAGAGGAGATCTCAGAACTAGAGATTGACGTGGATGA 
GCTCCTGGACATGGAGAGTGACGATGCCTGGGCTTCCAGGGTCAAGGAGCTGCTGGTTGA 

[C.G] 

TGTTACAAACCCACAGAGGCCTTCATCTCTGGCCTGCTGGACAAGATCCGGGCCATGCAG 
AAGCTGAGCACACCCCAGAAGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCAT 
AGGACAATCGCTACCCCCCGACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAG 
GCCTGGTTCCATGAGCAGGGCTCCTCGTGCCCCTGGCCCAGGGGTCTCTTCCCC TGCCCC 

ctcagttnccacttttggatttttttang 

aggaccacgcatctactttcagagccccccccggggccgcaggagagggcccgggctggg 
cggatgatgagggcccagtgaggcgccaagggaaggtcaccatcaagtatgaccccaagg 
agctacggaagcacctcaacctagaggagtggatcctggagcagctcacgcgcctctacg 
actgccaggaagaggagatctcagaactagagattgacgtggatgagctcctggacatgg 
agagtgacgatgcctgggcttccagggtcaaggagctgctggttgactgttacaaacgca 

[A.C] 

AGAGGCCTTCATCTCTGGCCTGCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACC 
CCAGAAGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTAC 
CCCCCGACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATGA 
GCAGGGCTCCTCGTGCCCCTGGCCCAGGGGTCTCTTCCCCT GCCCCC TCAGTTTTCCACT 
TnGGATTTTTTTAnGTTAnAMCTGATGGGACTTTGTGTTTTTAT^ 

TACTTTCAGAGCCCCCCCCGGGGCCGCAGGAGAGGGCCCGGGCTGGGCGGATGATGAGGG 
CCCAGTGAGGCGCCAAGGGAAGGTCACCATCAAGTATGACCCCAAGGAGCTACGGAAGCA 
CCTCAACCTAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACGACTGCCAGGAAGA 
GGAGATCTCAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGGAGAGTGACGATGC 
CTGGGCnCCAGGGTCMGGAGCTGCTGGTTGACTGTTACAAACCCACAGAGGCCTTCAT . 

tT.C] 

TCTGGCCTGCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACCCCAGAAGAAGTGA 
GGGTCCCCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCCGACCTCGT 
AGCAACAGCAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATGAGCAGG GCTCCTCG 
TGCCCCTGGCCCAGGGGTCTCnCGCCTGCCCCCTCAGTTTTCCACTTTTGGAl 1 1 1 1 1 1 
AnGnAnAMCTGATGGGACmGT(nTmATATTGACTCT(K;(^AC(mCCTrT 
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54712 CAGAGCCCCCCCCGG6GCCGCAGGAGAGGGCCCGGGCTGGGCGGATGATGAGGGCCCAGT 
GAGGCGCCAAGGGAAGGTCACCATCAAGTATGACCCCAAGGAGCTACGGAAGCACCTCAA 
CCTAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACGACTGCCAGGAAGAGGAGAT 
CTCAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGGAGAGTGACGATGCCTGGGC 
TTCCAGGGTCAAGGAGCTGCTGGTTGACTGTTACAAACCCACAGAGGCCTTCATCTCTGG 
[T,C] 

CTGCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACCCCAGAAGAAGTGAGGGTCC 
CCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCCGACCTCGTAGCAAC 
AGCAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATGAGCAGG GCTCCTCG TGCCCC 
TGGCCCAGGGGTCTCTTCCCCTGCCCCCTCAGTTTTCCACTTTTGGAI 1 1 1 1 1 IATTGTT 
AnAMCTGATGGGACTTTGTGTTrTTATATTGACTCTGCGGCACGGGCCCTTTAATAAA 

GTATGACCCCAAGGAGCTACGGAAGCACCTCAACCTAGAGGAGTGGATCCTGGAGCAGCT 
CACGCGCCTCTACGACTGCCAGGAAGAGGAGATCTCAGAACTAGAGATTGACGTGGATGA 
GCTCCTGGACATGGAGAGTGACGATGCCTGGGCTTCCAGGGTCAAGGAGCTGCTGGTTGA 
CTGTTACAAACCCACAGAGGCCTTCATCTCTGGCCTGCTGGACAAGATCCGGGCCATGCA 
GAAGCTGAGCACACCCCAGAAGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCA 
CT.C] 

AGGACAATCGCTACCCCCCGACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAG 
GCCTGGTTCCATGAGCAGGG CTCCT CGTGCCCCTGGCCCAGGGGTCTCTTCCCC TGCCCC 
CTCAGTTTTCCACTTTTGGA 1 1 1 1 1 1 1 ATTGTTATTAMCTGATGGGACTTTGTGTTTTT 
ATATTGACTCTGCGGCACGGGCCCTTTAATAAAGCGAGGTAGGGTACGCCTTTGGTGCAG 
CTCAAAAAAAAAAAAAAAMTGATTTCCAGCGGTCCACATTAGAGTTGAAATTTTCTGGT 

GGAAGCACCTCAACCTAGAGGAGTGGATCCTGGAGCAGGTCACGCGCCTCTACGACTGCC 
AGGV\AGAGGAGATCTCAGAACTAGAGATTGACGT6GATGAGCTCCTG6ACATGGAGAGTG 

ACGATGCCTGGGCTTCCAGGGTCAAGGAGCTGCTGGTTGACTGTTACAAACCCACAGAGG 

CCTTTMTCTOimi^^ 

AGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCC 
[G.A] 

ACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATG AGCA GGG 
CTCCTCGTGCCCCTGGCCCAGGGGTCTCnCCCC TGCCCC CTCAGTTTTCCACTTTTGGA 

1 1 1 1 1 1 lA nGTTATTAAACTGATGGGACTnGTGTTTTTATATTGACTCTGCGGCACGG . 
GeCCTnMTAAAGCGAGGTAGGGTACGCCmGGTGCAGCTCAAAAAAAAAAAAAAAAA 
TGATTTCCAGCGGTCCACATTAGAGTTGAMTTTTCTGGTGGGAGAATCTATACCTTGTT 

55499 TTGTTTTCTMTACCTCnGTCAnCTAMTATCmMmATTAAAAAATATATATAT 

ACAGTAnGMTGCCTACTGTGTGCTAGGTACAGTTCTAAACACTTGGGTTACAGCAGCG 
AACAAAATAAAGGTGCTTACCCTCATAGAACATAGATTCTAGCATGGTATCTACTGTATC 

ATACAGTAGATACMTMGTAAACTATATTGMTATTAGAATGTGGCAGATGCTATGGAA. 

AMGAGTCMGACMGTAMGACGAnGnCAGGGTACCAGTTGCMTTITAM 

[C.T] 

GTCAGAGCAGGCCTCACTGAGGTGACATGACATTTAAGCATAAACATGGAGGAGGAGGAG 
TAAGCCTGAGCTGTCTTAGGCTTCCGGGGCAGCCAAGCCATTTCCGTGGCACTAGGAGCC 
TGGTGmCCGAnCCACCmGATAACTGCATTTrCTCTAAGATATGGGAGGGAAGTTT 
TTCTCCTAnGTrnTMGTAnAACTCCAGCTAGTCCAGCCTTGTTATAGTGTTACCTA 
ATCTnATAGCAMTATATGAGGTACCGGTMCATTATGCCCATTTCTCACAGAGGCACT 

FIG. 3-35 
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56825 ACT6ATGGCTCAAA666T6TGAAAAAGTCAG TGAT GCTCCCCCTTTCTACTCCAGATCCT 

GTCCTTCCTGGAGCMGGnGAGGGAGTAGGTTTTGAAGAGTCCCTTAATATGTGGTGGA 

ACAGGCCAGGAGTTAGAGAAAGGGCTGGCTTCTGTTTACCTGCTCACTGGCTCTAGCCAG 

CCCAGGGACCACATCMTGTGAGAGGMGCCTCCACCTCATGTTTTCAAAC7TAATACTG 

GAGACTGGCTGAGAACTTACGGACAACATCCTTTCTGTCTGAAACAAACAGTCACAAGCA 

[CA] 

AGGAAGAGGCTGGGGGACTAGAAAGAGGCCCTGCCCTCTAGAAAGCTCAGATCTTGGCTT 
CTGTTACTCATACTCGGGTGGGCTCCnAGTCAGATGCCTAAMCATnTGCCTAAAGCT 
CGATGGGTTCTGGAGGACAGTGTGGCTTGTCACAGGCCTAGAGTCT6AGGGAGGGGAGTG 
GGAGTCTCAGCAATCTCTTGGTCTTGGCTTCATGGCAACCACTGCTCACCCTTCAACATG 
CCTGGmAGGCAGCAGCnGGGCTGGGMGAGGTGGTGGCAGAGTCTCAMGCTGAGAT 

58871 CGTCACCCACCACCCAACCCCTGCCGCACTCCAGCCTTTAACAAGGGCTGTCTAGATATT 

CAirnMCTACCTCCACCnGGAMCMTTGCTGAAGGGGAGAGGATTTGCAATGACCA 

ACCACCTTGTTGGGACGCCTGCACACCTGTCTTTCCTGCTTCAACCTGAAAGATTCCTGA 

TGATGATAATCTGGACACAGAAGCCGGGCACGGTGGCTCTAGCCTGTAATCTCAGCACTT 

TGGGAGGCCTCAGCAGGTGGATCACCTGAGATCAAGAGTTTGAGAACAGCCTGACCAACA 
CT.A] 

GGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCCAGGTGTGGTGGCACATACCTG 

TAATCCCAGCTACTCTGGAGGCTGAGGCAGGAGAATCGCTTGAACCCACAAGGCAGAGGT 

TGCAGTGAGGCGAGATCATGCCAITGCACTCCAGCCTGTGCAACAAGAGCCAAACTCCAT 
CTCAAAAAAAAAAA 

FIG. 3-36 
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ISOLATED HUMAN KINASE PROTEINS, 
NUCLEIC ACID MOLECULES ENCODING 
HUMAN KINASE PROTEINS, AND USES 

THEREOF 

. . . FIELD OF THE INVENTION. 

• '[ The present invention is in the field of kinase proteins that 
are related to the serine/threonine kinase subfamily, rccom- 
. bin ant DNA molecules, and protein production. The present 
invention specifically provides novel peptides and proteins 
that effect protein phosphorylation and nucleic acid mol- 
ecules encoding such peptide and protein molecules, all of 
which are useful in the development of human therapeutics 
and diagnostic compositions and methods. 

BACKGROUND OF THE INVENTION 

Protein Kinases 

Kinases regulate many different cell proliferation, 
differentiation, and signaling processes by adding phosphate 
groups to proteins. Uncontrolled signaling has been impli- 
cated in a variety of disease conditions including 
inflammation, cancer, arteriosclerosis, and psoriasis. 
Reversible protein phosphorylation is the main strategy for 
controlling activities of eukaryotic cells. It is estimated that 
more than 1000 of the 10,000 proteins active in a typical 
mammalian cell are phosphorylated. The high energy 
phosphate, which drives activation, is generally transferred 
from adenosine triphosphate molecules (ATP) to a particular 
protein by protein kinases and removed from that protein by 
protein phosphatases. Phosphorylation occurs in response to 
extra^cellular signals (hormones, neurotransmitters, growth 
and differentiation factors, etc), cell cycle checkpoints, and 
environmental or nutritional stresses and is roughly analo- 
gous to turning on a molecular switch. When the switch goes 
on, the appropriate protein kinase activates a metabolic 
enzyme, regulatory protein, receptor, cytoskeletal protein, 
ion channel or pump, or transcription factor. 

Hie kinases comprise the largest known protein group, a 
superfamily of enzymes with widely varied functions and 
specificities. They are usually named after their substrate, 
their regulatory molecules, or some aspect of a mutant 
phenotype. With regard to substrates, the protein kinases 
may be roughly divided into two groups; those that phos- 
phorylate tyrosine residues (protein tyrosine kinases, PTK) 
and those that phospborylate serine or threonine residues 
(serine/threonine kinases, STK). A few protein kinases have 
dual specificity and phosphorylate threonine and tyrosine 
residues. Almost all kinases contain a similar 250-300 
. amino acid catalytic domain. The N-terminal domain, which 
contains subdomains I-IV, generally folds into a two-Iobed 
structure, which binds and orients the ATP (or GTP) donor 
molecule. The larger C terminal lobe, which contains sub- 
domains VI A-XI, binds the protein substrate and carries out 
the transfer of the gamma phosphate from ATP to the 
hydroxyl group of a serine, threonine, or tyrosine residue. 
Subdomain V spans the two lobes. 

The kinases may be categorized into families by the 
different amino add sequences (generally between 5 and 
100 residues) located on either side of, or inserted into loops 
of, the kinase domain. These added amino acid sequences 
allow the regulation of each kinase as it recognizes and 
interacts with its target protein. The primary structure of the 
kinase domains is conserved and can be further subdivided 
into 11 subdomains. Each of the 11 subdomains contains 
specific residues and motifs or patterns of amino acids that 
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are characteristic of that subdomain and are highly con- 
served (Hardie, G. and Hanks, S. (1995) The Protein Kinase 
Facts Books, Vol 1:7-20 Academic Press, San Diego, Calif.). 

The second messenger dependent protein kinases prima- 
rily mediate the effects of second messengers such as cyclic 
AMP (cAMP), cyclic GMP, inositol triphosphate, 
phosphatidylinositol, 3,4,5-triphospbate, cyclic-ADPribose, 
arachidonic acid, diacylglycerol and calcium-calmodulin. 
The cyclic-AMP dependent protein kinases (PKA) arc 
important members of the STK family. Cyclic-AMP is an 
intracellular mediator of hormone action in all prokaryotic 
and animal cells that have been studied. Such hormone- 
induced cellular responses include thyroid hormone 
secretion, Cortisol secretion, progesterone secretion, glyco- 
gen breakdown, bone resorption, and regulation of heart rate 
and force of heart muscle contraction. PKA is found in all 
animal cells and is thought to account for the effects of 
cyclic-AMP in most of these cells. Altered PKA expression 
is implicated in a variety of disorders and diseases including 
cancer, thyroid disorders, diabetes, atherosclerosis, and car- 
diovascular disease (Isselbacher, K. J. et al. (1994) Harri- 
son's Principles of Internal Medicine, McGraw-Hill, New 
York, N.Y., pp. 416-^31, 1887). 

Calcium-calmodulin (CaM) dependent protein kinases are 
also members of STK family. Calmodulin is a calcium 
receptor that mediates many calcium regulated processes by 
binding to target proteins in response to the binding of 
calcium. The principle target protein in these processes is 
CaM dependent protein kinases. CaM-kinases are involved 
in regulation of smooth muscle contraction (MLC kinase), 
glycogen breakdown (phosphorylase kinase), and neu- 
rotransmission (CaM kinase I and CaM kinase II). CaM 
kinase I phosphorylates a variety of substrates including the . 
neurotransmitter related proteins synapsin I and II, the gene 
transcription regulator, CREB, and the cystic fibrosis con- 
ductance regulator protein, CFTR (Haribabu, B: et al. (1995) 
EMBO Journal 14:3679-86). CaM II kinase also phospho- 
rylates synapsin at different sites, and controls the synthesis 
of catecholamines in the brain through phosphorylation and 
activation of tyrosine hydroxylase. Many of the CaM 
kinases are activated by phosphorylation in addition to 
binding to CaM. The kinase may autophosphorylate itself, or 
be phosphorylated by another kinase as part of a "kinase 
cascade". 

Another ligand-activated protein kinase is 5VAMP- 
activated protein kinase (AMPK) (Gao, G. et al. (1996) J. 
Biol Chem. 15:8675-81). Mammalian AMPK is a regulator 
of fatty acid and sterol synthesis through phosphorylation of 
the enzymes acetyl-CoA carboxylase and 
bydroxymethylglutaryl-CoA reductase and mediates 
responses of these pathways to cellular stresses such as heat 
shock and depletion of glucose and ATP. AMPK is a 
heterotimeric complex comprised of a catalytic alpha sub- 
unit and two non-catalytic beta and gamma subunits that are 
believed to regulate the activity of the alpha subunit. Sub- 
units of AMPK have a much wider distribution in non- 
lipogenic tissues such as brain, heart, spleen, and lung than 
expected. This distribution suggests that its role may extend 
beyond regulation of lipid metabolism alone. 

The mitogen-activated protein kinases (MAP) are also 
members, of the STK family. MAP kinases also regulate 
intracellular signaling pathways. They mediate signal trans- 
duction from the cell surface to the nucleus via phosphory- 
lation cascades. Several subgroups have been identified, and 
each manifests different substrate specificities and responds 
to distinct extracellular stimuli (Egan, S. E. and Weinberg, 
R. A. (1993) Nature 365:781-783). MAP kinase signaling 
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pathways are present in mammalian cells as well as in yeast. 
The extracellular stimuli that activate mammalian pathways 
include epidermal growth factor (EGF), ultraviolet light, 
hyperosmolar medium, heat shock, endotoxic lipopolysac- 
charide (LPS), and pro-inflammatory cytokines such as 
tumor necrosis factor (TNF) and interleuidn-l QL-1). 

. ." PRK (proliferation-related kinase) is a serum/cytokine 
inducible STK that is involved in regulation of the cell cycle 
and cell proliferation in human megakaroytic cells (Li, B. et 



FIG. 2). LIMK proteins generally have serine/threonine 
kinase activity. The protein of the present invention may be 
a novel alternative splice form of the art-known protein 
provided in Genbank gi805161 ; however, the structure of 
the gene provided by the present invention is different from 
the art-known gene of gi8051618 and the first exon of the 
gene of the present invention is novel,, suggesting a novel 
gene rather than an alternative splice form. Furthermore, the 
protein of the present invention lacks an LIM domain 



al. (1996) J. Biol Chcm. 271:19402-5). PRK is related to ™ relative to gi8051618. The protein of the present invention 

the polo (derived from humans polo gene) family of STKs does contain the kinase catalytic domain, 

implicated in cell division. PRK is downregulated in lung Approximately 40 LIM proteins, named for the LIM 

tumor tissue and may be a proto-oncogene whose deregu- domains they contain, are known to exist in eukaryotes. LIM 

Iated expression in normal tissue leads to oncogenic trans- domains are conserved, cystein-rich structures that contain 2 

formation. Altered MAP kinase expression is implicated in 35 zinc fingers that are thought to modulate protein-protein 

a variety of disease conditions including cancer, interactions. LIMK1 and LIMK2 are members of a LIM 

inflammation, immune disorders, and disorders affecting subfamily characterized by 2 N-terminal LIM domains and 

growth and development. a C-terminal protein kinase domain. LIMK1 and LIMK2 

The cyclin-dependent protein kinases (CDKs) are another mRNA expression varies greatly between different tissues, 

group of STKs that control the progression of cells through 20 Th c protein kinase domains of UMK1 and UMK2 contain 

the cell cycle. Cyclins are small regulatory proteins that act a unique sequence motif comprising Asp-Leu-Asn-Ser-His- 

by binding to and activating CDKs that then trigger various ^ m subdomain VIB and a strongly basic insert between 

phases of the cell cycle by phosphorylating and activating subdomains VII and VDI (Okano et al., J. Biol Chem. 270 

selected proteins involved in the mitotic process. CDKs are ( 52 )» 31321-31330 (1995)). The protein kinase domain 

unique in that they require multiple inputs to become 25 present in LIMKs is significantly different than other kinase 



activated. In addition to the binding of cyclin, CDK activa- 
tion requires the phosphorylation of a specific threonine 
residue and the dephosphorylation of a specific tyrosine 
residue. 

Protein tyrosine kinases, PTKs, specifically phosphory- 
. late. tyrosine residues on their target proteins and may be 
divided into transmembrane, receptor PTKs and 
nontransmembrane, non-receptor PTKs. Transmembrane 
protein-tyrosine kinases are receptors for most growth fac- 
tors. Binding of growth factor to the receptor activates the 
transfer of a phosphate group from ATP to selected tyrosine 
side chains of the receptor and other specific proteins. 
Growth factors (GF) associated with receptor PTKs include; 
epidermal GF, platelet-derived GF, fibroblast GF, hepatocyte 
GF, insulin and insulin-like GFs, nerve GF, vascular endot- 
helial GF, and macrophage colony stimulating factor. 

Non-receptor PTKs lack transmembrane regions and, 
instead, form complexes with the intracellular regions of cell 
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domains, sharing about 32% identity. 

LIMK is activated by ROCK (a downstream effector of 
Rho) via phosphorylation. LIMK then phosphorylates 
cofilin, which inhibits its actin-depolymerizing activity, 
thereby leading to Rho-induced reorganization of the actin 
cytoskeleton (Maekawa et al, Science 285; 895r898, 1999). 

The LIMK2a and LIMK2b alternative transcript forms are 
differentially expressed in a tissue-specific manner and are 
generated by variation in transcriptional initiation utilizing 
alternative promoters. UMK2a contains 2 LIM domains, a 
PDZ domain (a domain that functions in protein-protein 
interactions targeting the protein to the submembranous 
compartment), and a kinase domain; whereas LIMK2b just 
has 15 LIM domains. Alteration of LIMK2a and LIMK2b 
regulation has been observed in some cancer cell lines 
(Osada et al., Biochem. Biophys. Res. Commun. 229: 
582-589, 1996). 

For a further review of LIMK proteins, see Nomoto et at, 



surface receptors. Such receptors that function through non- 45 Gene 236 (2), 259-271 (1999). 
receptor PTKs include those for cytokines, hormones Kinase proteins, particularly members of the serine/ 
(growth hormone and prolactin) and antigen-specific recep- threonine kinase subfamily, are a major target for drug 
tors on T and B lymphocytes. action and development. Accordingly, it is valuable to the 

Many of these PTKs were first identified as the products field of pharmaceutical development to identify and char- 
of mutant oncogenes in cancer cells where their activation 50 aclerize previously unknown members of this subfamily of 



was no longer subject to normal cellular controls. In fact, 
about one third of the known oncogenes encode PTKs, and 
it is well known that cellular transformation (oncogenesis) is 
often accompanied by increased tyrosine phosphorylation 
activity (Carbonneau H and Tonks NK (1992) Anna. Rev. 
Cell Biol 8:463-93). Regulation of PTK activity may 
therefore be an important strategy in controlling some types 
of cancer. 

LIM Domain Kinases 

The novel human protein, and encoding gene, provided by 
the present invention is related to the family of serine/ 
threonine kinases in general, particularly LIM domain 
kinases (LIMK), and shows the highest degree of similarity 
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kinase proteins. The present invention advances the state of 
the. art by providing previously unidentified, human kinase 
proteins that have homology to members of the serine/ 
threonine kinase subfamily. 

SUMMARY OF THE INVENTION 



The present invention is based in part on the identification 
of amino acid sequences of human kinase peptides and 
proteins that are related to the serine/threonine kinase 
60 subfamily, as well as allelic variants and other mammalian 
orthologs thereof. These unique peptide sequences, and 
nucleic acid sequences that encode these peptides, can be 
used as models for the development of human therapeutic 
targets, aid in the identification of therapeutic proteins, and 
to LIMK2, and the LIMK25 isoforn (Genbank gi8051618) 65 serve as targets for the development of human therapeutic 
in particular (see the amino acid sequence alignment of the agents that modulate kinase activity in cells and tissues that 
protein of the present invention against LIMK2b provided in express the kinase. Experimental data as provided in FIG. 1 
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indicates expression in humans in teratocarcinoma, ovary, 
testis, nervous tissue, bladder, infant and fetal brain, and 
thyroid gland. 

DESCRIPTION OF THE FIGURE SHEETS 

FIG. 1 provides the nucleotide' sequence of a cDNA 
molecule that encodes the kinase protein of the present 
invention. (SEQ ID NO:l) In addition, structure and func- 
tional information is provided, such as ATG start, stop and 
tissue distribution, where available, that allows one to 
readily determine specific uses of inventions based on this 
molecular sequence. Experimental data as provided in FIG. 
1 indicates expression in humans in teratocarcinoma, ovary, 
testis, nervous tissue, bladder, infant and fetal brain, and 
thyroid gland. 

FIG. 2 provides the predicted amino acid sequence of the 
kinase of the present invention. (SEQ ID NO:2) In addition 
structure and functional information such as protein family, 
function, and modification sites is provided where available, 
allowing one to readily determine specific uses of inventions 
based on this molecular sequence. 

FIG. 3 provides genomic sequences that span the gene 
encoding the kinase protein of the present invention. (SEQ 
ID NO:3) In addition structure and functional information, 
such as intron/exon structure, promoter location, etc., is 
provided where available, allowing one to readily determine 
specific uses of inventions based on this molecular 
sequence. As illustrated in FIG. 3, SNPs were identified at 
42 different nucleotide positions. 
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members of this family of proteins and proteins that have 
expression patterns similar to that of the present gene. Some 
of the more specific features of the peptides of the present 
invention, and the uses thereof, are described herein, par- 
ticularly in the Background of the Invention and in the 
annotation provided in the Figures, and/or are known within 
the art for each of the known serine/threonine kinase family 
or subfamily of kinase proteins. 

Specific Embodiments 

Peptide Molecules 
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DETAILED DESCRIPTION OF THE 
INVENTION 

General Description 

The present invention is based on the sequencing of the 
human genome. During the sequencing and assembly of the 
human genome, analysis of the sequence information 
revealed previously unidentified fragments of the human 
genome that encode peptides that share structural and/or 
sequence homology to protein/peptide/domains identified 
and characterized within the art as being a kinase protein or 
part of a kinase protein and are related to the serine/ 
threonine kinase subfamily. Utilizing these sequences, addi- 
tional genomic sequences were assembled and transcript 
and/or cDNA sequences were isolated and characterized. 
Based on this analysis, the present invention provides amino 
acid sequences of human kinase peptides and proteins that 
are related to the serine/threonine kinase subfamily, nucleic 
acid sequences in the form of transcript sequences, cDNA 
sequences and/or genomic sequences that encode these 
kinase peptides and proteins, nucleic acid variation (allelic 
information), tissue distribution of expression, and informa- 
tion about the closest art known protein/peptide/domain that 
has structural or sequence homology to the kinase of the 
present invention. 

In addition to being previously unknown, the peptides that 
are provided in the present invention are selected based on 
their ability to be used for the development of commercially 
important products and services. Specifically, the present 
peptides are selected based on homology and/or structural 
relatedness to known kinase proteins of the serine/threonine 
kinase subfamily and the expression pattern observed. 
Experimental data as provided in FIG. 1 indicates expres- 
sion in humans in teratocarcinoma, ovary, testis, nervous 
tissue, bladder, infant and fetal brain, and thyroid gland. The 
art has clearly established the commercial importance of 
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The present invention provides nucleic acid sequences 
that encode protein molecules that have been identified as 
being members of the kinase family of proteins and are 
related to the serine/threonine kinase subfamily (protein 
sequences are provided in FIG. 2, transcript/cDNA 
sequences are provided in FIG. 1 and genomic sequences are 
provided in FIG. 3). The peptide sequences provided in FIG. 
2, as well as the obvious variants described herein, particu- 
larly allelic variants as identified herein and using the 
information in FIG. 3, will be referred herein as the kinase 
peptides of the present invention, kinase peptides, or 
peptides^proteins of the present invention. 

The present invention provides isolated peptide and pro- 
tein molecules that consist o£ consist essentially of, or 
comprise the amino acid sequences of the kinase peptides 
disclosed in the FIG. 2, (encoded by the nucleic acid 
molecule shown in FIG. 1, transcript/cDNA or FIG. 3, 
genomic sequence), as well as all obvious variants of these 
peptides that are within the art to make and use. Some of 
these variants are described in detail below. 

As used herein, a peptide is said to be "isolated" or 
"purified" when it is substantially free of cellular material or 
free of chemical precursors or other chemicals. The peptides 
of the present invention can be purified to homogeneity or - 
other degrees of purity. The level of purification will be 
based on the intended use. The critical feature is that the 
preparation allows for the desired function of the peptide, 
even if in the presence of considerable amounts of other 
components (the features of an isolated nucleic acid mol- 
ecule is discussed below). 

In some uses, "substantially free of cellular material" 
includes preparations of the peptide having less than about 
30% (by dry weight) other proteins (i.e., contaminating 
protein), less than about 20% other proteins, less than about 
10% other proteins, or less than about 5% other proteins. 
When the peptide is recombinantly produced, it can also be 
substantially free of culture medium, i.e., culture medium 
represents less than about 20% of the volume of the protein 
preparation. • 

The language "substantially free of chemical precursors 
or other chemicals" includes preparations of the peptide in 
which it is separated from chemical precursors or other 
chemicals that are involved in its synthesis. In one 
embodiment, the language "substantially free of chemical 
precursors or other chemicals" includes preparations of the 
kinase peptide having less than about 30% (by dry weight) 
chemical precursors or other chemicals, less than about 20% 
chemical precursors or other chemicals, less than about 10% 
chemical precursors or other chemicals, or less than about 
5% chemical precursors or other chemicals. 

The isolated kinase peptide can be purified from cells that 
naturally express it, purified from cells that have been 
altered to express it (recombinant), or synthesized using 
known protein synthesis methods. Experimental data as 
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provided in FIG. 1 indicates expression in humans in 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
and fetal brain, and thyroid gland. For example, a nucleic 
acid molecule encoding the kinase peptide is cloned into an 
expression vector, the expression vector introduced into a 
host cell and the protein, expressed in the host cell. The 
protein can then be isolated from the cells by an appropriate 
' purification scheme using standard protein purification tech- 
. niques. Many of these techniques are described in detail 
below. 

Accordingly, the present invention provides proteins that 
consist of the amino acid sequences provided in FIG. 2 (SEQ 
ID NO:2), for example, proteins encoded by the transcript/ 
cDNA nucleic acid sequences shown in FIG. 1 (SEQ ID 
NO:l) and the genomic sequences provided in FIG. 3 (SEQ 
ID NO: 3). The amino acid sequence of such a protein is 
provided in FIG. 2. A protein consists of an amino acid 
sequence when the amino acid sequence is the final amino 
acid sequence of the protein. 

The present invention further provides proteins that con- 
sist essentially of the amino acid sequences provided in FIG. 

2 (SEQ ID NO: 2), for example, proteins encoded by the 
transcript/cDNA nucleic acid sequences shown in FIG. 1 
(SEQ ID NO:l) and the genomic sequences provided in FIG. 

3 (SEQ ID NO:3). A protein consists essentially of an amino 
acid sequence when such an amino acid sequence is present 
with only a few additional amino acid residues, for example 
from about 1 to about 100 or so additional residues, typically 
from 1 to about 20 additional residues in the final protein. 

The present invention further provides proteins that com- 
prise the amino acid sequences provided in FIG. 2 (SEQ ID 
NO:2), for example, proteins encoded by the transcript/ 
cDNA nucleic acid sequences shown in FIG. 1 (SEQ ID 
NO:l) and the genomic sequences provided in FIG. 3 (SEQ 
ID NO:3). A protein comprises an amino acid sequence 
when the amino acid sequence is at least part of the final 
amino acid sequence of the protein. In such a fashion, the 
protein can be only the peptide or have additional amino acid 
molecules, such as amino acid residues (contiguous encoded 
sequence) that are naturally associated with it or heterolo- 
gous amino acid residues/peptide sequences. Such a protein 
can have a few additional amino acid residues or can 
comprise several hundred or more additional amino acids. 
The preferred classes of proteins that are comprised of the 
kinase peptides of the present invention are the naturally 
occurring mature proteins. A brief description of how vari- 
ous types of these proteins can be made/isolated is provided 
below. 

The kinase peptides of the present invention can be 
attached to heterologous sequences to form chimeric or 
fusion proteins. Such chimeric and fusion proteins comprise 
a kinase peptide operatively linked to a heterologous protein 
having an amino acid sequence not substantially homolo- 
gous to the kinase peptide. "Operatively linked" indicates 
that the kinase peptide and the heterologous protein are 
fused in-frame. The heterologous protein can be fused to the 
N-terminus or C-terminus of the kinase peptide. 

In some uses, the fusion protein does not affect the 
activity of the kinase peptide per sc. For example, the fusion 
protein can include, but is not limited to, enzymatic fusion 
proteins, for example beta-galactosidase fusions, yeast two- 
hybrid GAL fusions, poly-His fusions, MYC-tagged, 
Hi-tagged and Ig fusions. Such fusion proteins, particularly 
poly-His fusions, can facilitate the purification of recombi- 
nant kinase peptide. In certain host cells (e.g., mammalian 
host cells), expression and/or secretion of a protein can be 
increased by using a heterologous signal sequence. 
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A chimeric or fusion protein can be produced by standard 
recombinant DNA techniques. For example, DNA fragments 
coding for the different protein sequences are ligated 
together in-frame in accordance with conventional tech- 
niques. In another embodiment, the fusion gene can be 
synthesized by conventional techniques including auto- 
mated DNA synthesizers. Alternatively, PCR amplification 
of gene fragments can be carried but using anchor primers 
which give rise to complementary overhangs between two 
consecutive gene fragments which can subsequently be 
annealed and re-amplified to generate a chimeric gene 
sequence (see Ausubel et al., Current Protocols in Molecu- 
lar Biology, 1992). Moreover, many expression vectors are 
commercially available that already encode a fusion moiety 
(e.g., a GST protein). A kinase peptide-encoding nucleic 
acid can be cloned into such an expression vector such that 
the fusion moiety is linked in-frame to the kinase peptide. 

As mentioned above, the present invention also provides 
and enables obvious variants of the amino acid sequence of 
the proteins of the present invention, such as naturally 
occurring mature forms of the peptide, allelic/sequence 
variants of the peptides, non-naturaUy occurring recombi- 
nant])' derived variants of the peptides, and orthologs and 
paralogs of the peptides. Such variants can readily be 
generated using art-known techniques in the fields of recom- 
binant nucleic acid technology and protein biochemistry. It 
is understood, however, that variants exclude any amino acid 
sequences disclosed prior to the invention. 

Such variants can readily be identified/made using 
molecular techniques and the sequence information dis- 
closed herein. Further, such variants can readily be distin- 
guished . from other peptides . based on sequence and/or 
structural homology to the kinase peptides of the present 
invention. The degree of homology/identity present will be 
based primarily on whether the peptide is a functional 
variant or non-functional variant, the amount of divergence 
present in the paralog family and the evolutionary distance 
between the orthologs. 

To determine the percent identity of two amino acid 
sequences or two nucleic acid sequences, the sequences are 
aligned for optimal comparison purposes (e.g., gaps can be 
introduced in one or both of a first and a second amino acid 
or nucleic acid sequence for optimal alignment and non- 
homologous sequences can be disregarded for comparison 
purposes). In a preferred embodiment, at least 30%, 40%, 
50%, 60%, 70%, 80%, or 90% or more of the length of a 
reference sequence is aligned for comparison purposes. The 
amino acid residues or nucleotides at corresponding amino 
acid positions or nucleotide positions are then compared. 
When a position in the first sequence is occupied by the 
same amino acid residue or nucleotide as the corresponding . 
position in the second sequence, then the molecules are 
identical at that position (as used herein amino acid or 
nucleic acid "identity" is equivalent to amino acid or nucleic 
acid "homology")- The percent identity between the two 
sequences is a function of the number of identical positions 
shared by the sequences, taking into account the number of 
gaps, and the length of each gap, which need to be intro- 
duced for optimal alignment of the two sequences. 

The comparison of sequences and determination of per- 
cent identity and similarity between two sequences can be 
accomplished using a mathematical algorithm. 
{Computational Molecular Biology, Lesk, A. M., cd., 
Oxford University Press, New York, 1988; Biocomputing: 
Informatics and Genome Projects, Smith, D. W., ed., Aca- 
demic Press, New York, 1993; Computer Analysis of 
Sequence Data, Part 1, Griffin, A. M., and Griffin, H. G., 
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eds., Humaoa Press, New Jersey, 1994; Sequence Analysis the proteins) have significant homology when the amino 
in Molecular Biology t von Heinje, G., Academic Press, acid sequences are typically at least about 7(MJ0%, 80-90%, 

1987; and Sequence Analysis Primer, Gribskov, M. and and more typically at least about 90-95% or more homolo- 

Devereux, J., eds., M Stockton Press, New York, 1991). In g 0 us. A significantly homologous amino acid sequence, 

a preferred embodiment, the percent identity between two 5 according to the present invention, will be encoded by a 

. amino acid sequences is determined using the Ncedlcman nucleic acid sequence that wfll hybridize to a kinase peptide 

:and Wunsch (/. Mol Biol (4S);444-453 (1970)) algorithm . CDCO ding nucleic acid molecule under stringent conditions 

which has been incorporated into the GAP program in the ^ more described below 

GCG software package (available at FIG. 3 provides information on SNft that have been 

using either a Blossom 62 matrix or a PAM250 matrix, and 10 found m ^ ^ { m f ^ 

a gap weight of 16 14 12, 10, 8 6, or 4 and a length weight ^ ^ a * ^ at 4 / different 

of 1, 2, 3, 4, 5, or 6. In yet another preferred embodiment, Some of ^ are {h& 

the percent identity between two nucleoade sequences is * Rf ^ m . • ^ transcription. 

. determmed using the GAP program m the GCG software n . ' J . e 

package (Devereux, J., et &L, Nucleic Acids Res. 12(1):387 15 Paralogs of a kmase pcpUde can readily be identified as 

(1984)) (available at http://www.gcg.com), using a NWS- de S rcc of . significant sequence homology/ 

gapdna.CMP matrix and a gap weight of 40, 50, 60, 70, or ldcntllv to at Ieasl a P orllon of the fanasc P c P udc » 35 bcin S 

80 and a length weight of 1, 2, 3, 4, 5, or 6. In another cncoded bv a gene from humans, and as having similar 

embodiment, the percent identity between two amino acid or activitv or faction. Two proteins will typically be consid- 

nucieotide sequences is determined using the algorithm of E. 20 cred P ara [°g s thc amiD0 acid sequences are typically 

Myers and W. Miller (CABIOS, 4:11-17 (1989)) which has al ^ast about 60% or greater, and more typically at least 

been incorporated into the ALIGN program (version 2.0), about 70% or greater homology through a given region or 

using a PAM120 weight residue table, a gap length penalty domain - Such P"*W? ^ be ***** b r a . nuclcic acid 
of 12 and a gap penalty of 4. sequence that will hybridize to a kinase peptide encoding 

tu t *j j «r nucleic acid molecule under moderate to stringent condi- 

The nucleic acid and protein sequences 01 the present 95 4 . - „ , j . , & 

*• . „„j „ ? tr, tions as more fully described below, 

invention can further be used as a "query sequence to J 

perform a search against sequence databases to, for example, Orthologs of a kinase peptide can readily be identified as 
identify other family members or related sequences. Such having some degree of significant sequence homology/ 
searches can be performed using the NBLAST and identity to at least a portion of the kinase peptide as well as 
XBLAST programs (version 2.0) of Altschul, et al (/. Mol 30 bcin 8 encoded by a gene from another organism. Preferred 
Biol. 215:403-10 (1990)). BLAST nucleotide searches can orthologs will be isolated from mammals, preferably 
be performed with the NBLAST program, score-100, primates, for the.development of human therapeutic targets 
wordIength-12 to obtain nucleotide sequences homologous : and agents. Such orthologs will be encoded by a nucleic acid 
to the nucleic acid molecules of the invention. BLAST sequence that will hybridize to a kinase peptide encoding 
protein searches can be performed with the XBLAST 35 nucleic acid molecule under moderate to stringent 
program, score-50, wordlengtb-3 to obtain amino acid conditions, as more fully described below, depending on the 
sequences homologous to the proteins of the invention. To d egree of relatedness of the two organisms yielding the 
obtain gapped alignments for comparison purposes, Gapped proteins. 

BLAST can be utilized as described in Altschul et al. Non-natuxaUy.occurring variants of the kinase peptides of 
{Nucleic Acids Res. 25(17):3389-3402 (1997)). When uti- 40 the present invention can readily be generated using recom- 
Lizing BLAST and gapped BLAST programs, the default binant techniques. Such variants include, but are not limited 
parameters of the respective programs (e.g., XBLAST and to deletions, additions and substitutions in the amino acid 
NBLAST) can be used. sequence of the kinase peptide. For example, one class of 

Full-length pre-processed forms, as well as mature pro- substitutions are conserved amino acid substitution. Such 
cessed forms, of proteins that comprise one of the peptides 45 substitutions are those that substitute a given amino add in 
of the present invention can readily be identified as having a kinase peptide by another amino acid of like characteris- 
complete sequence identity to one of the kinase peptides of tics. Typically seen as conservative substitutions are the 
the present invention as well as being encoded by the same replacements, one for another, among the aliphatic amino 
genetic locus as the kinase peptide provided herein. The acids Ala, Val, Leu, and lie; interchange of the hydroxyl 
gene encoding the novel kinase protein of the present 50 residues Ser and Thr; exchange of the acidic residues Asp,, 
invention is located on a genome component that has been and GIu; substitution between the amide residues Asn and- 
mapped to human chromosome 22 (as indicated in FIG. 3), . Gin; exchange of the basic residues Lys and Arg; and 
which is supported by multiple lines of evidence, such as replacements among the aromatic residues Phe and Tyr. 
STS and BAC map data. Guidance concerning which amino acid changes are likely to 

Allelic variants of a kinase peptide can readily be iden- 55 be phenotypically silent are found in Bowie et al., Science 
tilled as being a human protein having a high degree 247:1306-1310 (1990). 

(significant) of sequence homology/identity to at least a Variant kinase peptides can be fully functional or can lack 
portion of the kinase peptide as well as being encoded by the function in one or more activities, e.g. ability to bind 
same genetic locus as the kinase peptide provided herein. substrate, ability to pbosphorylate substrate, ability to medi- 
Genetic locus can readily be determined based on the 60 ate signaling, etc. Fully functional variants typically contain 
genomic information provided in FIG. 3, such as the only conservative variation or variation in non-critical resi- 
genomic sequence mapped to the reference human. The gene dues or in non-critical regions. FIG. 2 provides the result of 
encoding the novel kinase protein of the present invention is protein analysis and can be used to identify critical domains/ 
located on a genome component that has been mapped to regions. Functional variants can also contain substitution of 
human chromosome 22 (as indicated in FIG. 3), which is 65 similar amino acids that result in no change or an insignifi- 
supported by multiple lines of evidence, such as STS and cant change in function. Alternatively, such substitutions 
BAC map data. As used herein, two proteins (or a region of may positively or negatively affect function to some degree. 
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Non-functional variants typically contain one or more teolytic processing, phosphorylation, preoylation, 

non -conservative amino acid substitutions, deletions, racemization, selenoylation, sulfation, transfer-RNA medi- 

inscrtions, inversions, or truncation or a substitution, atcd addition of amino acids to proteins such as arginylation, 

insertion, inversion, or deletion in a critical residue or and ubiquitination. 

critical region. 5 . Such modifications are well known to those of skill in the 

Amino acids that are essential for function can be iden- art and have been described in great detail in the scientific 

••tiffed by methods known in the art, such as site-directed literature. Several particularly common modifications, 

mutagenesis or alanine-scanning mutagenesis (Cunningham glycosylation, lipid attachment, sulfation, gamma- 

et al., Science 244:1081-1085 (1989)), particularly using the carboxylation of glutamic acid residues, hydroxylation and 

results provided in FIG. 2. The latter procedure introduces 30 ADP-ribosylation, for instance, are described in most basic 

single alanine mutations at every residue in the molecule. texts, such as Proteins— Structure and Molecular 

The resulting mutant molecules are then tested for biological Properties, 2nd Ed., T. E. Creighton, W. H. Freeman and 

activity such as kinase activity or in assays such as an in Company, New York (1993). Many detailed reviews are 

vitro proliferative activity. Sites that arc critical for binding available on this subject, such as by Wold, F., Posttransla- 

partner/substratc binding can also be determined by struc- 15 tional Covalent Modification of Proteins, B. C. Johnson, 

tural analysis such as crystallization, nuclear magnetic reso- Ed., Academic Press, New York 1-12 (1983); Seifier et al. 

nance or photoafifinity labeling (Smith et aL, J. Mol Biol (Metk Enzymol. 182: 626-646 (1990)) and Rattan et al. 

224:899-904 (1992); de Vos et al. Science 255:306-312 (Ann N.Y.Acad. Sci. 663:48-62 (1992)). 
(1992)). ^ Accordingly, the kinase peptides of the present invention 
The present invention further provides fragments of the 20 also encompass derivatives or analogs in which a substituted 

kinase peptides, in addition to proteins and peptides that amino acid residue is not one encoded by the genetic code, 

comprise and consist of such fragments, particularly those in which a substituent group is included, in which the mature 

comprising the residues identified in FIG. 2. The fragments kinase peptide is fused with another compound, such as a 

to which the invention pertains, however, are not to be compound to increase the half-life of the kinase peptide (for 
construed as encompassing fragments that may be disclosed 25 example, polyethylene glycol), or in which the additional 

publicly prior to the present invention. amino acids axe fused to the mature kinase peptide, such as 

As used . herein, a fragment comprises at least 8, 10, 12, a leader or secretory sequence or a sequence for purification 

14, 16, or more contiguous amino acid residues from a °f the mature kinase peptide or a pro-protein sequence, 
kinase peptide. Such fragments can be chosen based on the 

ability to retain one or more ofthe biological activities of the Protein/PepUde Uses 

kinase peptide or could be chosen for the ability to perform The proteins of the present invention can be used in 

a function, e.g. bind a substrate or act as an immunogen. substantial and specific assays related to the functional 

Particularly important fragments are biologically active information provided in the Figures; to raise antibodies or to 
fragments, peptides that are, for example, about 8 or more 35 elicit another immune response; as a reagent (including the 

amino acids in length. Such fragments will typically com- labeled reagent) in assays designed to quantitatively deter- 

prise a domain or motif of the kinase peptide, e.g., active mine levels of the protein (or its binding partner or ligand) 

site, a transmembrane domain or a substrate-binding in biological fluids; and as markers for tissues in which the 

domain. Further, possible fragments include, but are not corresponding protein is preferentially expressed (either 
limited to, domain or motif containing fragments, soluble ^ constitutively or at a particular stage of tissue differentiation 

peptide fragments, and fragments containing immunogenic or development or in a disease state). Where the protein 

structures. Predicted domains and functional sites are readily binds or potentially binds to another protein or ligand (such 

identifiable by computer programs well known and readily as, for example, in a kinase-effector protein interaction or 

available to those of skill in the art (e.g., PROS1TE analysis). kinase-ligand interaction), the protein can be used to identify 
The results of one such analysis arc provided in FIG. 2. 45 the binding partner/ligand so as to develop a system to 

Polypeptides often contain amino acids other than the 20 identify inhibitors of the binding interaction. Any or all of 

amino acids commonly referred to as the 20 naturally these uses are capable of being developed into reagent grade 

occurring amino acids. Further, many amino acids, including or kit format for commercialization as commercial products, 

the terminal amino acids, may be modified by natural Methods for performing the uses listed above are well 

..processes, such as processing and other post-translational 50 known to those skilled in the art. References disclosing such, . 

modifications, or by chemical modification techniques well methods include "Molecular Cloning: A Laboratory 

known in the art. Common modifications that occur natu- Manual", 2d ed., Cold Spring Harbor Laboratory Press, 

rally in kinase peptides are described in basic texts, detailed Sambrcok, J., E. F. Fritsch and T. Maniatis eds., 1989, and 

monographs, and the research literature, and they are well "Methods in Enzymology: Guide to Molecular Cloning 

known to those of skill in the art (some of these features are 55 Techniques", Academic Press, Berger, S. L. and A- R. 

identified in FIG. 2). Kimmel eds., 1987. 

Known modifications include, but are not limited to, The potential uses of the peptides of the present invention 

acetylation, acylation, ADP-ribosylation, amidation, cova- are based primarily on the source of the protein as well as the 

lent attachment of flavin, covalent attachment of a heme class/action of the protein. For example, kinases isolated 

moiety, covalent attachment of a nucleotide or nucleotide 60 from humans and their human/mammalian orthologs serve 

derivative, covalent attachment of a lipid or lipid derivative, as targets for identifying agents for use in mammalian 

covalent attachment of phosphotidylinositol, cross-linking, . therapeutic applications, e.g. a human drug, particularly in 

cyclization, disulfide bond formation, demethylation, for- modulating a biological or pathological response in a cell or 

mation of covalent crosslinks, formation of cystine, forma- tissue that expresses the kinase. Experimental data as pro- 

tion of pyroglutamate, formylation, gamma carboxylation, 65 vided in FIG. 1 indicates that the kinase proteins of the 

glycosylation, GPI anchor formation, hydroxylation, present invention are expressed in humans in 

iodination, methylation, myristoylation, oxidation, pro- teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
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brain, and thyroid gland, as indicated by virtual northern blot 
analysis. In addition, PCR-based tissue screening panels 
indicate expression in fetal brain. A large percentage of 
pharmaceutical agents are being developed that modulate 
the activity of kinase proteins, particularly members of the 
serine/threonine kinase subfamily (sec. Background of the 
Invention). The structural and functional information pro- 
vided in the Background and Figures provide specific and 
substantial uses for the molecules of the present invention, 
particularly in combination with the expression information 
provided in FIG. 1. Experimental data as provided in FIG. 
1 indicates expression in humans in teratocarcinoma, ovary,- 
testis, nervous tissue, bladder, infant and fetal brain, and 
thyroid gland. Such uses can readily be determined using the 
information provided herein, that which is known in the art, 
and routine experimentation. 

The proteins of the present invention (including variants 
and fragments that may have been disclosed prior to the 
present invention) are useful for biological assays related to 
kinases that are related to members of the serine/threonine 
kinase subfamily. Such assays involve any of the known 
kinase functions or activities or properties useful for diag- 
nosis and treatment of kinase-related conditions that are 
specific for the subfamily of kinases that the one of the 



10 



transduction such as protein phosphorylation, cAMP 
turnover, and adenylate cyclase activation, etc. 

Candidate compounds include, for example, 1) peptides 
such as soluble peptides, including Ig-tailed fusion peptides 
and members of random peptide libraries (see, e.g., Lam et 
ah, Nature 354:82-84 (1991); Houghten et al., Nature 
354:84-86 (1991)) and combinatorial chemistry-derived 
molecular libraries made of D- and/or L-configuration 
amino acids; 2) phosphopeplides (e.g., members of random 
and partially degenerate, directed pbosphopeptide libraries, 
see, e.g., Songyang et al., Cell 72:767-778 (1993)); 3) 
antibodies (e.g., polyclonal, monoclonal, humanized, anti- 
idiotype, chimeric, and single chain antibodies as well as 
Fab, F(ab%, Fab expression library fragments, and epitppc- 
15 binding fragments of antibodies); and 4) small organic and 
inorganic molecules (e.g., molecules obtained from combi- 
natorial and natural product libraries). 

One candidate compound is a soluble fragment of the 
receptor that competes for substrate binding. Other candi- 
date compounds include mutant kinases or appropriate frag- 
ments containing mutations that affect kinase function and 
thus compete for substrate. Accordingly, a fragment that 
competes for substrate, for example with a higher affinity, or 



20 



present invention belongs to, particularly in cells and tissues « 8 fra g ment ih /} binds substrate but does not allow release, is 
that express the kinase. Experimental data as provided in encompassed by^e invention. 

FIG. 1 indicates that the kinase proteins of the present The invention further includes other end point assays to 
invention are expressed in humans in teratocarcinoma, identify compounds that modulate (stimulate or inhibit) 
ovary, testis, nervous tissue, bladder, infant brain, and thy- kinase activity. The assays typically involve an assay of 
roid gland, as indicated by virtual northern blot analysis. In 30 events in the signal transduction pathway that indicate 



addition, PCR-based tissue screening panels indicate expres- 
sion in fetal brain. 

The, proteins of the present invention are also usefull in 
drug screening assays, in cell-based or cell-free systems. 
Cell-based systems can be native, i.e., cells that normally 35 
express the kinase, as a biopsy or expanded in cell culture. 
Experimental data as provided in FIG. 1 indicates expres- 
sion in humans in teratocarcinoma, ovary, testis, nervous 
tissue, bladder, infant and fetal brain, and thyroid gland. In 



kinase activity. Thus, the phosphorylation of a substrate, 
activation of a protein, a change in the expression of genes 
that are up- or down-regulated in response to the kinase 
protein dependent signal cascade can be assayed. 

Any of the biological or biochemical functions mediated 
by the kinase can be used as an endpoint assay. These 
include all of the biochemical or biochemical/biological 
events described herein, in the references cited herein, 
incorporated by reference for these endpoint assay targets, 



an alternate embodiment, cell-based assays involve recom- 40 and other functions known to those of ordinary skill in the 
binant host cells expressing the kinase protein. 

The polypeptides can be used to identify compounds that 
modulate kinase activity of the protein in its natural state or 
an altered form that causes a specific disease or pathology 
associated with the kinase. Both the kinases of the present 45 
invention and appropriate variants and fragments can be 
used in high-throughput screens to assay candidate com- 
pounds for the ability to bind to the kinase. These com- 
pounds can be further screened against a functional kinase to 
determine the effect of the compound on the kinase activity. 50 
Further, these compounds can be tested in animal or inver- 
tebrate systems to determine activity/effectiveness. Com- 
pounds can be identified that activate (agonist) or inactivate 



art or that can be readily identified using the information 
provided in the Figures, particularly FIG. 2. Specifically, a 
biological function of a cell or tissues that expresses the 
kinase can be assayed. Experimental data as provided in 
FIG. 1 indicates that the kinase proteins of the present 
invention are expressed in humans in teratocarcinoma, 
ovary, testis, nervous tissue, bladder, infant brain, and thy- 
roid gland, as indicated by virtual northern blot analysis. In 
addition, PCR-based tissue screening panels indicate expres- 
sion in fetal brain. t -w 

Binding and/or activating compounds can also be 
screened by using chimeric kinase proteins in which the 
amino terminal extracellular domain, or parts thereof, the 
entire transmembrane domain or subregions, such as any of 



(antagonist) the kinase to a desired degree. 

Further, the proteins of the present invention can be used 55 the seven transmembrane segments or any of the intracel- 

to screen a compound for the ability to stimulate or inhibit hilar or extracellular loops and the carboxy terminal intra- 

interaction between the kinase protein and a molecule that cellular domain, or parts thereof, can be replaced by heter- 

normally interacts with the kinase protein, e.g. a substrate or ologous domains or subregions. For example, a substrate- 

a component of the signal pathway that the kinase protein binding region can be used that interacts with a different 

normally interacts (for example, another kinase). Such 60 substrate then that which is recognized by the native kinase. 



assays typically include the steps of combining the kinase 
protein with a candidate compound under conditions that 
allow the kinase protein, or fragment, to interact with the 
target molecule, and to detect the formation of a complex 
between the protein and the target or to detect the biochemi- 65 
cal consequence of the interaction with the kinase protein 
and the target, such as any of the associated effects of signal 



Accordingly, a different set of signal transduction compo- 
nents is available as an end-point assay for activation. This 
allows for assays to be performed in other than the specific 
host cell from which the kinase is derived. 

The proteins of the present invention are also useful in 
competition binding assays in methods designed to discover 
compounds that interact with the kinase (e.g. binding part- 
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ners and/or ligands). Thus, a compound is exposed to a kinase activity in a pharmaceutical composition to a subject 

kinase polypeptide under conditions that allow the com- in need of such treatment, the modulator being identified as 

pound to bind or to otherwise interact with the polypeptide. described herein. 

Soluble kinase polypeptide is also added to the mixture. If In yet another aspect of the invention, the kinase proteins 
the test compound interacts with the soluble kinase - 5 can be used as "bait proteins" in a two-hybrid assay or 
polypeptide, it decreases the amount of complex formed or three-hybrid assay (see, e.g., U.S. Pat. No. 5,283,317; Zer- 
activity from the kinase target. This type of assay , is par- vos et al. (1993) Cell 72:223-232; Madura et al. (1993) 7. 
ticularly useful in cases in which compounds are sought that Biol Chem, 268:12046-12054; Bartel et al. (1993) Biotech- 
interact with specific regions of the kinase. Thus, the soluble niques 14:920-924; lwabuchi et al. (1993) Oncogene 
polypeptide that competes with the target kinase region is 1Q 8:1693/1696; and Brent WO94110300), to identify other 
designed to contain peptide sequences corresponding to the proteins, which bind to or interact with the kinase and are 
region of interest. involved in kinase activity. Such kinase -binding proteins are 
To perform cell free drug screening assays, it is some- also likely to be involved in the propagation of signals by the 
times desirable to immobilize either the kinase protein, or P"> tcins or Knasc targets as for example, down- 
fragment, or its target molecule to facilitate separation of 15 *«™ f}™™* ° f * knasc-mcdiated signaling pathway, 
conlplexes from uncomplexed forms of one or both of the j^J^^ ***t4>wdwg proteins are likely to be 

proteins, as well as to accommodate automation of the assay. ,* . 

* . - . • * • . • u The two -hybrid system is based on the modular nature 01 

Techniques for immobilizing proteins on matrices can be mo$t ^ whkh C0Dsist of separabIe DNA . 

used m the drug screening assays In one embodiment a bifldiQg ^ activation domains . BrieflV( ^ assay utilizes 
fusion protein can be provided which adds a domain that 20 two different DNAconstructs. In one construct, the gene that 
allows the protein to be bound to a matrix. For example, codcs for a p:o i c - m & fa^d t0 a gcac encoding the 

glutathione-S-transferase fusion proteins can be adsorbed DNA binding domain of a known transcription factor (e.g., 
onto glutathione seph arose beads (Sigma Chemical, St. GAL-4). In the other construct, a DNA sequence, from a 
Louis, Mo.) or glutathione derivatized microtitre plates, library of DNA sequences, that encodes an unidentified 
which are then combined with the cell lysates (e.g., 35 S- 25 protein ("prey" or "sample") is fused to a gene that codes for 
. labeled) and the candidate compound, and the mixture the activation domain of the known transcription factor. If 
incubated under conditions conducive to complex formation the "bait" and the "prey" proteins are able to interact, in 
(e.g., at physiological conditions for salt and pH). Following vivo, forming a kinase-dependent complex, the DNA- 
incubation, the beads are washed to remove any unbound binding and activation domains of the transcription factor 
label, and the matrix immobilized and radiolabel determined 30 are brought into close proximity. Tnis proximity allows 
directly, or in the supernatant after the complexes are transcription of a reporter gene (e.g., LacZ) which is oper- 
dissociated. Alternatively, the complexes can be dissociated ably linked to a transcriptional regulatory site responsive to 
from, the matrix, separated by SDS-PAGE, and the level of the transcription factor. Expression of the reporter gene can. 
kinase-binding protein found in the bead fraction quantitated be detected and cell colonies containing the functional 
from the gel using standard electrophoretic techniques. For 35 transcription factor can be isolated and used to obtain the 
example, either the polypeptide or its target molecule can be cloned gene which encodes the protein which interacts with 
immobilized utilizing conjugation of bio tin and streptavidin the kinase protein. 

using techniques well known in the art. Alternatively, anti- This invention further pertains to novel agents identified 

bodies reactive with the protein but which do not interfere by the above-described screening assays. Accordingly, it is 
with binding of the protein to its target molecule can be 40 within the scope of this invention to further use an agent 

derivatized to the wells of the plate, and the protein trapped identified as described herein in an appropriate animal 

in the wells by antibody conjugation. Preparations of a model. For example, an agent identified as described herein 

kinase-binding protein and a candidate compound are incu- (e.g., a kinase-modulating agent, an antisense kinase nucleic 

bated in the kinase protein-presenting wells and the amount acid molecule, a kinase-specific antibody, or a kinase- 
of complex trapped in the well can be quantitated. Methods 45 binding partner) can be used in an animal or other model to 

for detecting such complexes, in addition to those described determine the efficacy, toxicity, or side effects of treatment 

above for the GST-immobilized complexes, include immu- with such an agent. Alternatively, an agent identified as 

nodetection of complexes using antibodies reactive with the described herein can be used in an animal or other model to 

kinase protein target molecule, or which are reactive with determine the mechanism of action of such an agent, 
kinase protein and compete with the target molecule, as well 50 Furthermore, this invention pertains to uses of novel agents . 

as enzyme-linked assays which rely on detecting an enzy- . identified by the above-described screening assays for treat- 

ma tic activity associated with the target molecule. ments as described herein. •-. 

Agents that modulate one of the kinases of the present The kinase proteins of the present invention are also 

invention can be identified using one or more of the above useful to provide a target for diagnosing a disease or 

assays, alone or in combination. It is generally preferable to 55 predisposition to disease mediated by the peptide, 

use a cell-based or cell free system first and then confirm Accordingly, the invention provides methods for detecting 

activity in an animal or other model system. Such model the presence, or levels of; the protein (or encoding mRNA) 

systems are well known in the art and can readily be in a cell, tissue, or organism. Experimental data as provided 

employed in this context. in FIG. 1 indicates expression in humans in teratocarcinoma, 

Modulators of kinase protein activity identified according 60 ovary, testis, nervous tissue, bladder, infant and fetal brain, 

to these drug screening assays can be used to treat a subject and thyroid gland. The method involves contacting a bio- 

with a disorder mediated by the kinase pathway, by treating logical sample with a compound capable of interacting with 

cells or tissues that express the kinase. Experimental data as the kinase protein such that the interaction can be detected, 

provided in FIG. 1 indicates expression in humans in Such an assay can be provided in a single detection format 

teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 65 or a multi-detection format such as an antibody chip array, 
and fetal brain, and thyroid gland. These methods of treat- One agent for detecting a protein in a sample is an 

ment include the steps of administering a modulator of antibody capable of selectively binding to protein. A bio- 



US 6,340,583 Bl 
17 18 

logical sample includes tissues, cells and biological fluids more or less active in substrate binding, and kinase activa- 
isolated from a subject, as well as tissues, cells and fluids tion. Accordingly, substrate dosage would necessarily be 
present within a subject. modified to maximize the therapeutic effect within a given 

The peptides of the present invention also provide targets population containing a polymorphism. As an alternative to 
for diagnosing active protein activity, disease, or prcdispo- 5 genotyping, specific polymorphic peptides could be identi- 
sition to disease, " in a patient having a variant peptide, fled. 

particularly activities and conditions that are known for The peptides are also useful for treating a disorder char- 
other members of the family of proteins to which the present acterized by an absence of, inappropriate, or unwanted 
one belongs. Thus, the peptide can be isolated from a expression of the protein. Experimental data as provided in 
biological sample and assayed for the presence of a genetic 10 * ^dicates expression in humans in teratocarcinoma, 
mutation that results in aberrant peptide. This includes ovary, testis, nervous tissue, bladder, infant and fetal brain, 
amino acid substitution, deletion, insertion, rearrangement, and thyroid gland. Accordingly, methods for treatment 
(as the result of aberrant splicing events), and inappropriate include the use of the kinase protein or fragments, 
post-translatiooal modification. Analytic methods include Antibodies 
altered clectrophoretic mobility, altered tryptic peptide i$ . . 

digest, altered kinase activity in cell-based or cell-free assay, . . V 1 c mvenUoD ako . provides antibodies that selectively 
alteration in substrate or antibody-binding pattern, altered bind l0 . °. nc of the peptides of the present invention, a protein 
isoelectric point, direct amino acid sequencing, and any compnsmg such a pephde, as well as variants and fragments 
other of the known assay techniques useful for detecting thc « of - As used herein, an antibody selectively binds a 
mutations in a protein. Such an assay can be provided in a 20 ^rgct peptide when it binds the target peptide and does not 
single detection format or a multi-detection format such as " ««o™Uy b "\ d to unrelated proteins. An antibody is still 
an antibody chip array considered to selectively bind a peptide even if it also binds 

In vitro techniques for detection of peptide include |o other proteins mat are not substantiaUy homologous wim 
enzyme linked immunosorbent assays (EUSAs), Western PCp f 50 L° Dg " P roleiD * share homology 

blots, immunoprecipitations and immunofluorescence using 25 ^A fr ? g ? k e . Dt ° r d ? ma ^°[ ^ Peptide target of the 
a detection repeat, such as an antibody or protein binding J*?*** V w ° uld * that antibody 

agent. Alternatively, the peptide can be detected in vivo in a Jj^^^ 11 * " SbU dCSpitC dCgrCC 

subject by introducing into the subject a labeled anli -peptide " . " • 

antibody or other types of detection agent. For example, the hereiD .' an * defined in terms consistent 

antibody can be labeled with a radioactive marker whose 30 Wlth . at ^fZ* 32 ™ mthm art: the y arc multi-subunit 
presence and location in a subject can be detected by proteins produced by a mammalian organism in response to 
standard imaging techniques. Particularly useful are metfa- f a anbgen challenge. The antibodies of the present invention 
ods that detect the allelic variant of a peptide expressed in a . mclude Phonal antibodies and monoclonal antibodies, as 
subject and methods which detect fragments of a peptide in ? cI ? *» "Wnenls of such antibodies, including, but not 
a sample. 35 umitcd to » Fao or ^ab^, and Fv fragments. 

The peptides are also useful in pharmacogenomic analy- Man > r methods are ^ for generating and/or identify- 
sis. Pharmacogenomics deal with clinically significant m f aotil » dies f° « 1 target peptide. Several such meth- 
hereditary variations in the response to drugs due to altered °j !» are _ de ^"^L by Harlow ' ^^dics, Cold Spring 
drug disposition and abnormal action in affected persons. Harbor Prcss ' ( 1989 )- 

See, e.g., Eichelbaum, M. {Clin, Exp, Pharmacol, Physiol 40 In g e °eral, to generate antibodies, an isolated peptide is 
23(10-ll);983-985 (1996)), and Under, M. W. {Clin, Chem. used ™ an ima™°ogen and is administered to a mammalian 
43(2):254-266 (1997)). The clinical outcomes of these organism, such as a rat, rabbit or mouse. The full-length 
variations result in severe toxicity of therapeutic drugs in protein, an antigenic peptide fragment or a fusion protein 
certain individuals or therapeutic failure of drugs in certain can Particularly important fragments are those 

individuals as a result of individual variation in metabolism. 45 covering functional domains, such as the domains identified 
Thus, the genotype of the individual can determine the way m HG * 2 > 311(1 doma "i of sequence homology or divergence 
a therapeutic compound acts on the body or the way the amongst the family, such as those that can readily be 
body metabolizes the compound. Further, the activity of identified using protein alignment methods and as presented 
drug metabolizing enzymes effects both the intensity and m 100 Figures. 

duration of drug action. Thus, the pharmacogenomics of the 50 Antibodies are preferably prepared from regions or dis- 
individual permit the selection of effective compounds and fragments of the kinase proteins. Antibodies can be 

effective dosages of such compounds for prophylactic or prepared from any region of the peptide as described herein, 
therapeutic treatment based on the individual's genotype. However, preferred regions will include those involved in 
The discovery of genetic polymorphisms in some drug function/activity and/or kinase/binding partner interaction, 
metabolizing enzymes has explained why some patients do 55 FIG. 2 can be used to identify particularly important regions 
not obtain the expected drug effects, show an exaggerated wnu * e sequence alignment can be used to identify conserved 
drug effect, or experience serious toxicity from standard and unique sequence fragments. 

drug dosages. Polymorphisms can be expressed in the phe- An antigenic fragment will typically comprise at least 8 
notype of the extensive metabolizer and the phenotype of the contiguous amino acid residues. The antigenic peptide can 
poor metabolizer. Accordingly, genetic polymorphism may 60 comprise, however, at least 10, 12, 14, 16 or more amino 
lead to allelic protein variants of the kinase protein in which acid residues. Such fragments can be selected on a physical 
one or more of the kinase functions in one population is property, such as fragments correspond to regions that are 
different from those in another population. The peptides thus located on the surface of the protein, e.g., hydrophilic 
allow a target to ascertain a genetic predisposition that can regions or can be selected based on sequence uniqueness 
affect treatment modality. Thus, in a ligand-based treatment, 65 (see FIG. 2). 

polymorphism may give rise to amino terminal extracellular Detection on an antibody of the present invention can be 
domains and/or other substrate-binding regions that are facilitated by coupling (i.e., physically linking) the antibody 
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to a detectable substance. Examples of detectable substances 
include various enzymes, prosthetic groups, fluorescent 
materials, luminescent materials, bioluminescent materials, 
and radioactive materials. Examples of suitable enzymes 
include horseradish peroxidase, alkaline phosphatase, 
p-galactosidase, or acetylcholinesterase; examples of suit- 
able prosthetic group complexes include streptavidin/biotin 
and avidin/biotiri; examples of suitable fluorescent materials 
include umbelliferone, fluorescein, fluorescein 



proteins can be used to identify individuals that require 
modified treatment modalities. The antibodies are also use- 
ful as diagnostic tools as an immunological marker for 
aberrant protein analyzed by electropboretic mobility, iso- 
electric point, tryptic peptide digest, and other physical 
assays known, to those in the art. . . 

The antibodies are also useful for tissue typing. Experi- 
mental data as provided in FIG. 1 indicates expression in 
humans in teratocarcinoma, ovary, testis, nervous tissue, 



isothiocyanate, rhodamine, dichlorotriazinylamine io bladder, infant and fetal brain, and thyroid gland. Thus, 
fluorescein, dansyl chloride or phycoerythrin; an example of 
a luminescent material includes luminol; examples of biolu- 
minescent materials include luciferase, luciferin, and 
aequorin, and examples of suitable radioactive material 
include 125 1, 131 1, 35 S or 3 H. 
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Antibody Uses 

The antibodies can be used to isolate one of the proteins 
of the present invention by standard techniques, such as 
affinity chromatography or immunoprecipitation. The anti- 
bodies can facilitate the purification of the natural protein 
from cells and recombinantly produced protein expressed in 
host cells. In addition, such antibodies are useful to detect 



20 



where a specific protein has been correlated with expression 
in a specific tissue, antibodies that are specific for this 
protein can be used to identify a tissue type. 

The antibodies are also useful for inhibiting protein 
function, for example, blocking the binding of the kinase 
peptide to a binding partner such as a substrate. These uses 
can also be applied in a therapeutic context in which 
treatment involves inhibiting the protein's function. An 
antibody can be used, for example, to block binding, thus 
modulating (agonizing or antagonizing) the peptides activ- 
ity. Antibodies can be prepared against specific fragments 
containing sites required for function or against intact pro- 
tein that is associated with a cell or cell membrane. See FIG. 



the presence of one of the proteins of the present invention . . * . 

in cells or tissues to determine the pattern of expression of ^ 2 for structural information relating to the proteins of the 



the protein among various tissues in an organism and over 
the course of normal development. Experimental data as 
provided in FIG. 1 indicates that the kinase proteins of the 
present invention are expressed in humans in 



present invention.* 

The invention also encompasses kits for using antibodies 
to detect the presence of a protein in a biological sample. 
The kit can comprise antibodies such as a labeled or label- 



teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 30 able antibody and a compound or agent for detecting protein 
brain, and thyroid gland, as indicated by virtual northern blot . in a biological sample; means for determining the amount of 
analysis. In addition, PCR-based tissue screening panels protein in the sample; means for comparing the amount of 
indicate expression in fetal brain. Further, such antibodies protein in the sample with a standard; and instructions for 
can be used to detect protein in situ, in vitro, or in a cell use. Such a kit can be supplied to detect a single protein or 
lysate or supernatant in order to evaluate the abundance and 35 epitope or can be configured to detect one of a multitude of 
pattern of expression. Also, such antibodies can be used to * 
assess abnormal tissue distribution or abnormal expression 
during development or progression of a biological condition. 
Antibody detection of circulating fragments of the full 
length protein can be used to identify turnover. ^ 

Further, the antibodies can be used to assess expression in 
disease states such as in active stages of the disease or in an 
individual with a predisposition toward disease related to the 
protein's function. When a disorder is caused by an inap 



epitopes, such as in an antibody detection array. Arrays are 
described in detail below for nuleic acid arrays and similar 
methods have been developed for antibody arrays. 

Nucleic Acid Molecules 

The present invention further provides isolated nucleic 
acid molecules that encode a kinase peptide or protein of the 
present invention (cDNA, transcript and genomic sequence). 
Such nucleic acid molecules will consist of, consist essen- 



propriate tissue distribution, developmental expression, 45 tially of, or comprise a nucleotide sequence that encodes one 

level of expression of the protein, or expressed/processed of the kinase peptides of the present invention, an allelic 

form, the antibody can be prepared against the normal variant thereof, or an ortholog or paralog thereof, 

protein. Experimental data as provided in FIG. 1 indicates As used herein, an "isolated" nucleic acid molecule is one 

expression in humans in teratocarcinoma, ovary, testis, ner- that is separated from other nucleic acid present in the 

vous tissue, bladder, infant and fetal brain, and thyroid 50 natural source of the nucleic acid. Preferably, an "isolated*' 

gland. If a disorder is characterized by a specific mutation in nucleic acid is free of sequences Which naturally flank the 

the protein, antibodies specific for this mutant protein can be nucleic acid (i.e., sequences located at the 5' and 3' ends of 

used to assay for the presence of the specific mutant protein. the nucleic acid) in the genomic DNAof the organism from 

The antibodies can also be used to assess normal and which the nucleic acid is derived. However, there can be 
aberrant subcellular localization of cells in the various 55 some flanking nucleotide sequences, for example up to 
tissues in an organism. Experimental data as provided in about 5KB, 4KB, 3KB, 2KB, or 1KB or less, particularly 
FIG. 1 indicates expression in humans in teratocarcinoma, contiguous peptide encoding sequences and peptide encod- 
ovary, testis, nervous tissue, bladder, infant and fetal brain, ing sequences within the same gene but separated by introns 
and thyroid gland. The diagnostic uses can be applied, not in the genomic sequence. The important point is that the 
only in genetic testing, but also in monitoring a treatment 60 nucleic acid is isolated from remote and unimportant flank- 
modality. Accordingly, where treatment is ultimately aimed ing sequences such that it can be subjected to the specific 
at correcting expression level or the presence of aberrant 
sequence and aberrant tissue distribution or developmental 
expression, antibodies directed against the protein or rel- 
evant fragments can be used to monitor therapeutic efficacy. $5 

Additionally, antibodies are useful in phannacogenomic 
analysis. Thus, antibodies prepared against polymorphic 



manipulations described herein such as recombinant 
expression, preparation of probes and primers, and other 
uses specific to the nucleic acid sequences. 

Moreover, an "isolated" nucleic acid molecule, such as a 
transcript/cDNA molecule, can be substantially free of other 
cellular material, or culture medium when produced by 
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recombinant techniques, or chemical precursors or other a protein from precursor to a mature form, facilitate protein 

chemicals when chemically synthesized. However, the trafficking, prolong or shorten protein half-life or facilitate 

nucleic acid molecule can be fused to other coding or manipulation of a protein for assay or production, among 

regulatory sequences and still be considered isolated. other things. As generally is the case in situ, the additional 
. For example, recombinant DNAmolecules contained in a. * amino acids may be processed away from the mature protein 

vector are considered isolated. Further examples of isolated by ccUular enzymes-. 

DNA molecules include recombinant DNA molecules main- ... . ^ menUoned above, the isolated nucleic add molecules 

tained in heterologous host cells or purified (partially or £ cIudc « bu \ ? ot lm V lcd t0 ' thc *<l ucncc cnco dm S the 

substantially) DNA molecules in solution. Isolated RNA ^ll^J^ ^ scqucnoc CnCod * g °i alure 

molecules include in vivo or in vitro RNA transcripts of the " E g sequcnces ' w * ? S a Ieadef ° r 

isolated DNA molecules of the present invention Isolated ^^^^^l^Z ™ ^ ?a 

«• . . . the sequence encoding the mature peptide, with or without 

nucleic acid molecules according to the present invention f t,„ ^l-t^ooi * 1W1 „ , , 

a _*u * i j . , . & , • • ■ a. tne additional codmg sequences, plus additional non-coding 
further include such molecules produced synthetically. *■ ° i a • , ,. , ° 

r 3 J sequences, for example mtrons and non-coding 5 and 3 

Accordingly, thc present invention provides nucleic acid sequences such as transcribed but non-translated sequences 
molecules that consist of the nucleotide sequence shown in lha , p i ay a ro!c in transcription, mRNA processing 
FIG. 1 or 3 (SEQ ID NO:l, transcript sequence and SEQ ID (including splicing and polyadenylation signals), ribosome 
NO:3, genomic sequence), or any nucleic acid molecule that binding and stability of mRNA. In addition, the nucleic acid 
encodes the protein provided in FIG. 2, SEQ ID NO:2. A molecule may be fused to a marker sequence encoding, for 
nucleic acid molecule consists of a nucleotide sequence example, a peptide that facilitates purification, 
when the nucleotide sequence is the complete nucleotide ~ u Isolated nuclcic add molecuIes can ^ m ^ form of 
sequence of the nucleic acid molecule. m A, such as mRNA, or in the form DNA, including cDNA 

The present invention further provides nucleic acid mol- an d genomic DNA obtained by cloning or produced by 
ecules that consist essentially of the nucleotide sequence chemical synthetic techniques or by a combination thereof, 
shown in FIG. 1 or 3 (SEQ ID NO:l, transcript sequence and 2J The nucleic acid, especially DNA, can be double-stranded or 
SEQ ID NO:3, genomic sequence), or any nucleic add single-stranded. Single-stranded nucleic acid can be the 
molecule that encodes the protein provided in FIG. 2, SEQ coding strand (sense strand) or the non-coding strand (anti- 
ID NO:2. A nucleic acid molecule consists essentially of a sense strand). 

nucleotide sequence when such a nucleotide sequence is ' ^ invention further provides nucleic acid molecules that 
present with only a few additional nucleic acid residues in 3Q cncodc fragments of the peptides of the present invention as 
the final nucleic acid molecule. well as nucleic acid molecules that encode obvious variants 

The present invention further provides nucleic acid mol- of the kinase proteins of the . present invention that are 
ecules that comprise the nucleotide sequences shown in FIG. described above. Such nucleic acid molecules may be natu- 
1 or 3 (SEQ ID NO:l, transcript sequence and SEQ ID rally occurring, such as allelic variants (same locus), para- 
NO:3, genomic sequence), or any nucleic acid molecule that 35 logs (different locus), and orthologs (different organism), or 
encodes the protein provided in FIG. 2, SEQ ID NO:2. A may be constructed by recombinant DNA methods or by 
nucleic acid molecule comprises a nucleotide sequence chemical synthesis. Such non-naturally occurring variants 
when the nucleotide sequence is at least part of the final may be made by mutagenesis techniques, including those 
nucleotide sequence of the nucleic acid molecule. In such a applied to nucleic acid molecules, cells, or organisms, 
fashion, the nucleic acid molecule can be only the nucleotide ^ Accordingly, as discussed above, the variants can contain 
sequence or have additional nucleic acid residues, such as nucleotide substitutions, deletions, inversions and inser- 
nucleic acid residues that are naturally associated with it or tions. Variation can occur in either or both the coding and 
heterologous nucleotide sequences. Such a nucleic acid non-coding regions. The variations can produce both con- 
molecule can have a few additional nucleotides or can servative and non-conservative amino acid substitutions, 
comprises several hundred or more additional nucleotides. A 45 The ptcscui invention further provides non<oding frag- 
brief description of how various types of these nucleic acid mCDts 0 f the nucleic acid molecules provided in FIGS. 1 and 
molecules can be readily made/isolated is provided below. 3. p rc f crrc d non-coding fragments include, but arc not 
In FIGS. 1 and 3, both coding and non-coding sequences limited to, promoter sequences, enhancer sequences, gene 
are provided. Because of the source of the present invention, modulating sequences and gene termination sequences., 
humans genomic sequence (FIG. 3) and cDNA/transcript 50 . Such fragments are useful in controlling heterologous gene 
sequences (FIG. 1), the nucleic add molecules in the Figures expression and in developing screens to . identify gene- 
will contain genomic intronic sequences, 5* and 3' non- modulating agents. A promoter can readily .be identified as - 
coding sequences, gene regulatory regions and non-coding being 5* to the ATG start site in the genomic sequence 
intergenic sequences. In general such sequence features are provided in FIG. 3. 

either noted in FIGS. 1 and 3 or can readily be identified 55 A fragment comprises a contiguous nucleotide sequence 
using computational tools known in the art. As discussed greater than 12 or more nucleotides. Further, a fragment 
below, some of the non-coding regions, particularly gene could at least 30, 40, 50, 100, 250 or 500 nucleotides in 
regulatory elements such as promoters, are useful for a length. The length of the fragment will be based on its 
variety of purposes, e.g. control of heterologous gene intended use. For example, the fragment can encode epitope 
expression, target for identifying gene activity modulating 60 bearing regions of the peptide, or can be useful as DNA 
compounds, and are particularly claimed as fragments of the probes and primers. Such fragments can be isolated using 
genomic sequence provided herein. the known nucleotide sequence to synthesize an oligonucle- 

Tbe isolated nucleic add molecules can encode the otide probe. A labeled probe can then be used to screen a 
mature protein plus additional amino or carboxyl-terminal cDNA library, genomic DNA library, or mRNA to isolate 
amino acids, or amino adds interior to the mature peptide 65 nucleic acid corresponding to the coding region. Further, 
(when the mature form has more than one peptide chain, for primers can be used in PCR reactions to clone specific 
instance). Such sequences may play a role in processing of regions of gene. 
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A probe/primer typically comprises substantially a puri- 
fied oligonucleotide or oligonucleotide pair. The oligonucle- 
otide typically comprises a region of nucleotide sequence 
that hybridizes under stringent conditions to at least about 
. 12, 20, 25, 40, 50 or more consecutive nucleotides. 

Orthologs, homologs, and allelic variants can be identified 
using methods well known in the art. As described in the 
Peptide Section, these variants comprise a nucleotide 
sequence encoding a peptide that is typically 60-70%, 
70-80%, 80-90%, and more typically at least about 90-95% 
or more homologous to the nucleotide sequence shown in 
the Figure sheets or a fragment of this sequence. Such 
nucleic acid molecules can readily be identified as being 
able to hybridize under moderate to stringent conditions, to 
the nucleotide sequence shown in the Figure sheets or a 
fragment of the sequence. Allelic variants can readily be 
determined by genetic locus of the encoding gene. The gene 
encoding the novel kinase protein of the present invention is 
located on a genome component that has been mapped to 
human chromosome 22 (as indicated in FIG. 3), which is 
supported by multiple lines of evidence, such as STS and 
BAC map data. 

FIG. 3 provides information on SNPs that have been 
found in the gene encoding the kinase protein of the present 
invention. SNPs were identified at 42 different nucleotide 
positions. Some of these SNPs, which are located outside the 
ORF and in introns, may affect gene transcription. 

As used herein, the term "hybridizes under stringent 
conditions" is intended to describe conditions for hybrid- 
ization and washing under .which nucleotide sequences 
encoding a peptide at least 60-70% homologous to each 
other typically remain hybridized to each other. The condi : 
tions can be such that sequences at least about 60%, at least 
about 70%, or at least about 80% or more homologous to 
each other typically remain hybridized to each other. Such 
stringent conditions are known to those skilled in the art and 
can be found in Current Protocols in Molecular Biology, 
John Wiley & Sons, N.Y. (1989), 6.3.1-63.6. One example 
of stringent hybridization conditions are hybridization in 6x 
sodium chloride/sodium citrate (SSC) at about 45C, fol- 
lowed by one or more washes in 0.2xSSC, 0.1% SDS at 
50-65C. Examples of moderate to low stringency hybrid- 
ization conditions are well known in the art. 
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Nucleic Acid Molecule Uses 

The nucleic acid molecules of the present invention are 
useful for probes, primers, chemical intermediates, and in 
biological assays. The nucleic acid molecules are useful as 
a hybridization probe for messenger RNA, transcript/cDNA 
and genomic DNA to isolate full-length cDNA and genomic 
clones encoding the peptide described in FIG. 2 and to 
isolate cDNA arid genomic clones that correspond to vari- 
ants (alleles, orthologs, etc.) producing the same or related 
peptides shown in FIG. 2. As illustrated in FIG. 3, SNPs 
were identified at 42 different nucleotide positions. 

The probe can correspond to any sequence along the 
entire length of the nucleic acid molecules provided in the 
Figures. Accordingly, it could be derived from 5* noncoding 
regions, the coding region, and 3' noncoding regions. 
However, as discussed, fragments are not to be construed as 
encompassing fragments disclosed prior to the present 
invention. 

The nucleic acid molecules are also useful as primers for 
PCR to amplify any given region of a nucleic acid molecule 
and are useful to synthesize antisense molecules of desired 
length and sequence. 



The nucleic acid molecules are also useful for construct- 
ing recombinant vectors. Such vectors include expression 
vectors that express a portion of, or all of, the peptide 
sequences. Vectors also include insertion vectors, used to 
5 integrate into another nucleic acid molecule sequence, such 
as into the cellular genome, to alter in situ expression of a 
gene and/or gene product. For example, an endogenous 
coding sequence can be replaced via homologous recombi- 
nation with all or part of the coding region containing one or 
10 more specifically introduced mutations. 

The nucleic acid molecules are also useful for expressing 
antigenic portions of the proteins. 

The nucleic acid molecules are also useful as probes for 
determining the chromosomal positions of the nucleic acid 
molecules by means of in situ hybridization methods. The 
gene encoding the novel kinase protein of the present 
invention is located on a genome component that has been 
mapped to human chromosome 22 (as indicated in FIG. 3), 
which is supported by multiple lines of evidence, such as 
STS and BAC map data. 

The nucleic acid molecules are also useful in making 
vectors containing the gene regulatory regions of the nucleic 
acid molecules of the present invention. 

25 The nucleic acid molecules are also useful for designing 
ribozymes corresponding to all, or a part, of the mRNA 
produced from the nucleic acid molecules described herein. 

The nucleic acid molecules are also useful for making 
vectors that express part, or all, of the peptides. 

30 The nucleic acid molecules are also useful for construct- 
ing host cells expressing a part, or all, of the nucleic acid 
molecules and peptides. 

The nucleic acid molecules are also useful for construct- 
ing transgenic animals expressing all, or a part, of the 
nucleic acid molecules and peptides. 

The nucleic acid, molecules are also useful as hybridiza- 
tion probes for determining the presence, level, form and 
distribution of nucleic acid expression. Experimental data as 
provided in FIG. 1 indicates that the kinase proteins of the 
present invention are expressed in humans in 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
brain, and thyroid gland, as indicated by virtual northern blot 
analysis. In addition, PCR-based tissue screening panels 
indicate expression in fetal brain. Accordingly, the probes 
can be used to detect the presence of, or to determine levels 
of, a specific nucleic acid molecule in cells, tissues, and in 
organisms. The nucleic acid whose level is determined can 
be DNA or RNA. Accordingly, probes corresponding to the 
peptides described herein can be used to assess expression 
and/or gene copy number in a given cell, tissue, or organism. 
These uses are relevant for diagnosis of disorders involving 
an increase or decrease in kinase protein expression relative 
to normal results. 

55 In vitro techniques for detection of mRNA include North- 
ern hybridizations and in situ hybridizations. In vitro tech- 
niques for detecting DNA includes Southern hybridizations 
and in situ hybridization. 

Probes can be used as a part of a diagnostic test kit for 
60 identifying cells or tissues that express a kinase protein, such 
as by measuring a level of a kinasc-encoding nucleic acid in 
a sample of cells from a subject e.g., mRNA or genomic 
DNA* or determining if a kinase gene has been mutated. 
Experimental data as provided in FIG. 1 indicates that the 
65 kinase proteins of the present invention are expressed in 
humans in teratocarcinoma, ovary, testis, nervous tissue, 
bladder, infant brain, and thyroid gland, as indicated by 
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virtual northern blot analysis. In addition, PCR-based tissue The nucleic acid molecules are also useful for monitoring 
screening panels indicate expression in fetal brain. the effectiveness of modulating compounds on the expres- 
Nucleic acid expression assays are useful for drug screen- sion or activity of the kinase gene in clinical trials or in a 
ing to identify compounds that modulate kinase nucleic acid treatment regimen. Thus, the gene expression pattern can 
. expression. 5 serve as a barometer for the continuing effectiveness of 
' The invention thus provides a method for identifying a treatment with the compound, particularly with compounds 
compound that can be used to treat a disorder associated to which a patient can develop resistance. The gene expres- 
with nucleic acid expression of the kinase gene, particularly sion pattern can also serve as a marker indicative of a 
biological and pathological processes that are mediated by physiological response of the affected cells to the compound, 
the kinase in cells and tissues that express it. Experimental J0 Accordingly, such monitoring would allow either increased 
data as provided in FIG. 1 indicates expression in humans in administration of the compound or the administration of 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant alternative compounds to which the patient has not become 
and fetal brain, and thyroid gland. The method typically resistant. Similarly, if the level of nucleic acid expression 
includes assaying the ability of the compound to modulate £ a lls below a desirable level, administration of the com- 
the expression of the kinase nucleic acid and thus identifying JS poimd could be commensurately decreased, 
a compound that can be used to treat a disorder characterized ^ nucleic acid moIecuIes m also fa diagnostic 
by undented kinase nucleic acid expression. The assays can assays for quaHtativ e changes in kinase nucleic acid 
be performed in cell-based and cell-free systems. Cell-based expression, and particularly in qualitative changes that lead 
assays include cells naturally expressing the kinase nucleic t0 palhoIogy . nucleic acid mo lecules can be used to 
acid or recombinant cells genetically engineered to express 2Q dctcct mutations in gencs and gene expression prod- 
specific nucleic acid sequences. ucts such ^ m R NA . The nucleic acid molecules can be used 
The assay for kinase nucleic acid expression can involve ^ hybridization probes to detect naturally occurring genetic 
direct assay of nucleic acid levels, such as mRNA levels, or mutations in the kinase gene and thereby to determine 
on collateral compounds involved in the signal pathway. whether a subject with the mutation is at risk "for a disorder 
Further, the expression of genes that are up- or down- M cauS ed by the .mutation. Mutations include deletion, 
regulated in response to the kinase protein signal pathway addition, or substitution of one or more nucleotides in the 
can also be assayed. In this embodiment the regulatory gen e, chromosomal rearrangement, such as inversion or 
regions of these genes can be operably linked to a reporter transposition, modification of genomic DNA, such as aber- 
gene such as luciferase. nn i methylation patterns or changes in gene copy number, 
Thus, modulators of kinase gene expression can be iden- 30 such as amplification. Detection of a mutated form of the 
.tified in a method wherein a cell is contacted with a , kinase gene associated with a dysfunction provides a diag- 
candidate compound and the expression of mRNA deter- nostic tool for an active disease or susceptibility to disease 
. mined. The level of expression of kinase mRNA in the when the disease results from overexpression, 
presence of the candidate compound is compared to the level underexpression, or altered expression of a kinase protein. 
. of expression of kinase mRNA in the absence of the candi- 3S Individuals carrying mutations in the kinase gene can be 
date compound. The candidate compound can then be iden- detected at the nucleic acid level by a variety of techniques, 
tified as a modulator of nucleic acid expression based on this piG. 3 provides information on SNPs that have been found 
comparison and be used, for example to treat a disorder m the gene encoding the kinase protein of the present 
characterized by aberrant nucleic acid expression. When invention. SNPs were identified at 42 different nucleotide 
expression of mRNA is statistically significantly greater in 40 positions. Some of these SNPs, which are located outside the 
the presence of the candidate compound than in its absence, QRF and in introns, may affect gene transcription. The gene 
the candidate compound is identified as a stimulator of encoding the novel kinase protein of the present invention is 
nucleic acid expression. When nucleic acid expression is located on a genome component that has been mapped to 
statistically significantly less in the presence of the candidate hum an chromosome 22 (as indicated in FIG. 3), which is 
compound than in its absence, the candidate compound is 45 supported by multiple lines of evidence, such as STS and 
identified as an inhibitor of nucleic acid expression. BAC map data. Genomic DNA can be analyzed directly or 
The invention further provides methods of treatment, with can be amplified by using PCR prior to analysis. RNAor 
the nucleic acid as a target, using a compound identified cDNAcan be used in the same way. In some uses, detection 
through drug screening as a gene modulator to modulate of the mutation involves the use of a probe/primer in a 
kinase nucleic acid expression in cells and tissues that 50 polymerase chain reaction (PCR) (sec, e.g. U.S. PaL Nos. , 
express the kinase. Experimental data as provided in FIG. 1 4,683,195 and 4,683,202), such as anchor PCR or RACE 
indicates that the kinase proteins of the present invention are ■ PCR, or, alternatively, in a ligation chain reaction (LCR) - 
expressed in humans in teratocarcinoma, ovary, testis, ner- (see, e.g., Landegran et al., Science 241:1077-1080 (1988); 
vous tissue, bladder, infant brain, and thyroid gland, as and Nakazawaetal., /WAS 91:360-364 (1994)), the latter of 
indicated by virtual northern blot analysis. In addition, 55 which can be particularly useful for detecting point muta- 
PCR-based tissue screening panels indicate expression in tions in the gene (see Abravaya et al. f Nucleic Acids Res. 
fetal brain. Modulation includes both up-regulation (i.e. 23:675-682 (1995)). This method can include the steps of 
activation or agonization) or down-regulation (suppression collecting a sample of cells from a patient, isolating nucleic 
or antagonization) or nucleic acid expression. acid (e.g., genomic, mRNA or both) from the cells of the 
Alternatively, a modulator for kinase nucleic acid expres- 60 sample, contacting the nucleic acid sample with one or more 
sion can be a small molecule or drug identified using the primers which specifically hybridize to a gene under con- 
screening assays described herein as long as the drug or ditions such that hybridization and amplification of the gene 
small molecule inhibits the kinase nucleic acid expression in (if present) occurs, and detecting the presence or absence of 
the cells and tissues that express the protein. Experimental an amplification product, or detecting the size of the ampli- 
data as provided in FIG. 1 indicates expression in humans in 65 fication product and comparing the length to a control 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant sample. Deletions and insertions can be detected by a change 
and fetal brain, and thyroid gland. in size of the amplified product compared to the normal 
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genotype. Point mutations can be identified by hybridizing involved in transcription, preventing transcription and hence 

amplified DNA to normal RNA or antisense DNA production of kinase protein. An antisense RNA or DNA 

sequences. nucleic acid molecule would hybridize to the mRNA and 

Alternatively, mutations in a kinase gene can be directly thus block translation of mRNA into kinase protein, 
identified, for example, by alterations in restriction enzyme 5 Alternatively, a class of antisense molecules can be used 

digestion patterns determined by gel electrophoresis. to inactivate mRNA in order to decrease expression of 

- Further sequence-specific ribozymes (U.S. Pat. No. kinase .nucleic acid. Accordingly, these molecules can treat 

5 498 531) can be used to score for the presence of specific ' a disorder characterized by abnormal or undesired kinase 

mutations by development or loss of a ribozyme cleavage nucleic acid expression. This technique involves cleavage 
site Perfectly matched sequences can be distinguished from 10 by means of ribozymes containing nucleotide sequences 

mismatched sequences by nuclease cleavage digestion complementary to one or more regions m the mRNA that 

assays or by differences in melting temperature. attenuate the ability of the mRNA to be translated. Possible 

Sequence changes at specific locations can also be regions include coding regions and particularly coding 

assessed by nuclease protection assays such as RNase and regions corresponding to the catalytic and o her Pactional 
SI protection or the chemical cleavage method. 15 activities of the kinase protein, such as substrate binding. 

Furthermore, sequence differences between a mutant kinase The nucleic acid molecules also provide vectors for gene 

gene and a wild-type gene can be determined by direct DNA therapy in patients containing cells that are aberrant in 

sequencing. A variety of automated sequencing procedures . kinase gene expression. Thus, recombinant cells, which 

can be utilized when performing the diagnostic assays include the patient's ceils that have been engineered ex vivo 
(Naeve, C. W., (1995) Biotechniques 19:448), including 20 and returned to the patient, are introduced into an individual 

sequencing by mass spectrometry (see, e.g., PCT Interna- where the cells produce the desired kinase protein to treat the 

tional Publication No. WO 94/16101; Cohen et al., Adv. individual. 

Chromatogr. 36:127-162 (1996); and Griffin et al, Appl The invention also encompasses kits for detecting the 

Biochem. Biotechnol 38:147-159 (1993)). presence of a kinase nucleic acid in a biological sample. 

Other methods for detecting mutations in the gene include Experimental data as provided in FIG. 1 indicates that the 
methods in which protection from cleavage agents is used to kinase proteins of the present invention are expressed in 
detect mismatched bases in RNA/RNA or RNADNA humans in teratocarcinoma, ovary, testis, nervous tissue, 
duplexes (Myers ct&U Science 230:1242 (1985)); Cotton et bladder, infant brain, and thyroid gland, as indicated by 
&UPNAS 85:4397 (1988); Saleeba et al., Afeift. Enzymol. 21 M virtual northern blot analysis. In addition, PCR-based tissue 
7:286-295 (1992)), electrophoretic mobility of mutant and screening panels indicate expression in fetal brain. For 
wild type nucleic acid is compared (Orita et al., PNAS example, the kit can comprise reagents such as a labeled or 
86:2766 (1989); Cotton et al., Mutai. Res, 285:125-144 labelable nucleic acid or agent capable of detecting kinase 
(19&); and Hayashi et al., Genet Anal Tech, Appl 9:73-79 nucleic acid in a biological sample; means for determining 
(1992)), and movement of mutant or wild-type fragments in 35 the amount of kinase nucleic acid in the sample; and means 
polyacrylamide gels containing a gradient of denaturant is for comparing the amount of kinase nucleic acid in the 
assayed using denaturing gradient gel electrophoresis sample with a standard. The compound or agent can be 
(Myers et al., Nature 313:495 (1985)). Examples of other packaged in a suitable container. The kit can further corn- 
techniques for detecting point mutations include selective prise instructions for using the kit to detect kinase protein 
oligonucleotide hybridization, selective amplification, and ^ nRNA or DNA. 
selective primer extension. ^ Nudeic Add ^ 

The nucleic acid molecules are also useful for testing an 
individual for a genotype that while not necessarily causing The present invention further provides nucleic acid detec- 
the disease, nevertheless affects the treatment modality. tion kits, such as arrays or microarrays of nucleic acid 
Thus, the nucleic acid molecules can be used to study the 45 molecules that are based on the sequence information pro- 
relationship between an individual's genotype and the indi- vided in FIGS. 1 and 3 (SEQ ID NOS:l and 3). 
vidual's response to a compound used for treatment As used herein "Arrays" or "Microarrays" refers to an 
(pharmacogenomic relationship). Accordingly, the nucleic array of distinct polynucleotides or oligonucleotides synthe- 
acid molecules described herein can be used to assess the sized on a substrate, such as paper, nylon or other type of 
mutation content of the kinase gene in an individual in order 50 membrane, filter, chip, glass slide, or any other suitable solid 
to select an appropriate compound or dosage regimen for support. In one embodiment, the microarray is prepared and 
treatment FIG. 3 provides information on SNPs that have used according to the methods described in U.S. Pat No; 
been found in the gene encoding the kinase protein of the 5,837,832, Chee et al., PCT application W095A1995 (Chee 
present invention. SNPs were identified at 42 different et al.), Lockhart, D. J. et al. (1996; Nat. Biotech. 14: 
nucleotide positions. Some of these SNPs, which are located 55 1675-1680) and Schena, M. et al. (1996; Proc. Natl Acad. 
outside the ORF and in introns, may affect gene transcrip- Set. 93: 10614-10619), all of which are incorporated herein 
tion. in their entirety by reference. In other embodiments, such 

Urns nucleic acid molecules displaying genetic variations arrays are produced by the methods described by Brown et 

that affect treatment provide a diagnostic target that can be al., U.S. Pat No. 5,807,522. 

used to tailor treatment in an individual. Accordingly, the 60 The microarray or detection kit is preferably composed of 

production of recombinant cells and animals containing a large number of * unique, single-stranded nucleic acid 

these polymorphisms allow effective clinical design of treat- sequences, usually either synthetic antisense obgonucle- 

ment compounds and dosage regimens. otides or fragments of cDNAs, fixed to a solid support The 

The nucleic acid molecules are thus useful as antisense oligonucleotides are preferably about 6-60 nucleotides in 

constructs to control kinase gene expression in cells, tissues, 65 length, more preferably 15-30 nucleotides in length, and 

and organisms. A DNA antisense nucleic acid molecule is most preferably about 20-25 nucleotides in length. For a 

designed to be complementary to a region of the gene certain type of microarray or detection kit, it may be 
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preferable to use oligonucleotides that are only 7-20 nucle- 
otides in length. The microarray or detection kit may contain 
oligonucleotides that cover the known 5', or 3', sequence, 
sequential oligonucleotides which cover the full length 
sequence; or unique oligonucleotides selected from particu- 
lar areas along the length of the sequence. Polynucleotides 
used in the microarray or detection kit may be oligonucle- 
otides that are specific to a gene or genes of interest. 



Using such arrays, the present invention provides meth- 
ods to identify the expression of the kinase proteios/peptides 
of the present invention. In detail, such methods comprise 
incubating a test sample with one or more nucleic acid 
molecules and assaying for binding of the nucleic acid 
molecule with components within the test sample. Such 
assays will typically involve arrays comprising many genes, 
at least one of which is a gene of the present invention and 



In order to produce oligonucleotides to a known sequence or aUe !f °. f & c S c ° c ° f * c P_ rcsc ° l invention. HG. 

for a microarray or detection kit, the gene(s) of interest (or 10 3 provides information on SNPs that have been found in the 
an ORF identified from the contigs of the present invention) encoding the kinase protein of the present invention. 

SNPs were identified at 42 different nucleotide positions. 



contigs of the present invention) 
is typically examined using a computer algorithm which 
starts at the 5* or at the 3' end of the nucleotide sequence. 
Typical algorithms will then identify oligomers of defined 
length that are unique to the gene, have a GC content within 
a range suitable for hybridization, and lack predicted sec- 
ondary structure that may interfere with hybridization. In 
certain situations it may be appropriate to use pairs of 
oligonucleotides on a microarray or detection kit. The 
"pairs'* will be identical, except for one nucleotide that 
preferably is located in the center of the sequence. The 
second oligonucleotide in the pair (mismatched by one) 
serves as a control. The number of oligonucleotide pairs may 
range from two to one million. The oligomers are synthe- 
sized at designated areas on a substrate using a light-directed 
chemical process. The substrate may be paper, nylon or 
other type of membrane, filter, chip, glass side or any other 
suitable solid support. 

In another aspect, an oligonucleotide may be synthesized 
on the surface of the substrate by using a chemical coupling 
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Some of these SNPs, which are located outside the ORF and 
in introns, may affect gene transcription. 

Conditions for incubating a nucleic acid molecule with a 
test sample vary. Incubation conditions depend on the format 
employed in the assay, the detection methods employed, and 
the type and nature of the nucleic acid molecule used in the 
assay. One skilled in the art will recognize that any one of 
the commonly available hybridization, amplification or 
array assay formats can readily be adapted to employ the 
novel fragments of the Human genome disclosed herein. 
Examples of such assays can be found in Chard, T, An 
Introduction to Radioimmunoassay and Related Techniques, 
Elsevier Science Publishers, Amsterdam, The Netherlands 
25 (1986); Bullock, G. R. et al., Techniques in 
Intmunocytochemistry, Academic Press, Orlando, Fla. \bl. 1 
(1 982), Vol. 2 (1983), \bl. 3 (1985); Tijssen, P., Practice 
and Tlieory of Enzyme Immunoassays: Laboratory Tech- 
niques in Biochemistry and Molecular Biology, Elsevier 



procedure and an ink jet application apparatus, as described 30 ScieDCC Amsterdam, The Netherlands (1985). 



in PCT application W095/251116 (Baljdeschweiler et al.) 
which is incorporated herein in its entirety by reference. In 
another aspect, a "gridded" array analogous to a dot (or slot) 
blot may be used to arrange and link cDNA fragments or 
oligonucleotides to the surface of a substrate using a vacuum 
system, thermal, UV, mechanical or chemical bonding pro- 
cedures. An array, such as those described above, may be 
produced by hand or by using available devices (slot blot or 
dot blot apparatus), materials (any suitable solid support), 
and machines (including robotic instruments), and may 
contain 8, 24, 96, 384, 1536, 6144 or more oligonucleotides, 
or any other number between two and one million which 
lends itself to the efficient use of commercially available 
instrumentation. 
In order to conduct sample analysis using a microarray or 
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The test samples of the. present invention include cells, 
protein or membrane extracts of cells. The test sample used 
in the above-described method will vary based on the assay 
format, nature of the detection method and the tissues, cells 
or extracts used as the sample to be assayed. Methods for 
preparing nucleic acid extracts or of cells are well known in 
the art and can be readily be adapted in order to obtain a 
sample that is compatible with the system utilized. 

In another embodiment of the present invention, kits are 
provided which contain the necessary reagents to carry out 
the assays of the present invention. 

Specifically, the invention provides a compartmentalized 
kit to receive, in close confinement, one or more containers 
which comprises: (a) a first container comprising one of the 
nucleic acid molecules that can bind to a fragment of the 



detection kit, the RNA or DNA from a biological sample is 45 Human genome disclosed herein; and (b) one or more other 
made into hybridization probes. The mRNA is isolated, and containers comprising one or more of the following: wash 
cDNA is produced and used as a template to make antisense reagents, reagents capable, of detecting presence of a bound 
RNA (aRNA). The aRNA is amplified in the presence of nucleic acid. 

fluorescent nucleotides, and labeled probes are incubated In detail, a compartmentalized kit includes any kit in 
with the microarray or detection kit so that the probe 50 which reagents are contained in separate containers. Such 



sequences hybridize to complementary oligonucleotides of 
the microarray or detection kit. Incubation conditions are 
adjusted so that hybridization occurs with precise comple- 
mentary matches or with various degrees of less comple- 
mentarity. After removal of nonhybridized probes, a scanner 5J 
is used to determine the levels and patterns of fluorescence. 
The scanned images are examined to determine degree of 
complementarity and the relative abundance of each oligo- 
nucleotide sequence on the microarray or detection kit. The 
biological samples may be obtained from any bodily fluids 
(such as blood, urine, saliva, phlegm, gastric juices, etc.), 
cultured cells, biopsies, or other tissue preparations. A 
detection system may be used to measure the absence, 
presence, and amount of hybridization for all of the distinct 
sequences simultaneously. This data may be used for large- 
scale correlation studies on the sequences, expression 65 
patterns, mutations, variants, or polymorphisms among 
samples. . 



60 



containers include small glass containers, plastic containers, 
strips of plastic, glass or paper, or arraying material such as 
silica. Such containers allows one to efficiently transfer 
reagents from one compartment to another compartment 
such that the samples and reagents are not cross- 
contaminated, and the agents or solutions of each container 
can be added in a quantitative fashion from one compart- 
ment to another. Such containers will include a container 
which will accept the test sample, a container which contains 
the nucleic acid probe, containers which contain wash 
reagents (such as phosphate buffered saline, Tris-buffers, 
etc.), and containers which contain the reagents used to 
detect the bound probe. One skilled in the art will readily 
recognize that the previously unidentified kinase gene of the 
present invention can be routinely identified using the 
sequence information disclosed herein can be readily incor- 
porated into one of the established kit formats which are well 
known in the art, particularly expression arrays. 
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The invention also provides vectors containing the nucleic 
acid molecules described herein. The term "vector" refers to 
a vehicle, preferably a nucleic acid molecule, which can 
transport the nucleic acid molecules. When the vector is a 
nucleic acid molecule, the nucleic , acid molecules are 
■ covalently linked to the vector nucleic acid. With this aspect 
of the invention, the vector includes a plasmid, single or 
double stranded phage, a single or double stranded RNA or 
DNA viral vector, or artificial chromosome, such as a BAC, 
PAC, YAC, OR MAC. 

A vector can be maintained in the host cell as an extra- 
chromosomal element where it replicates and produces 
additional copies of the nucleic acid molecules. 
Alternatively, the vector may integrate into the host cell 
genome and produce additional copies of the nucleic acid 
molecules when the host cell replicates. 

The invention provides vectors for the maintenance 
(cloning vectors) or vectors for expression (expression 
vectors) of the nucleic acid molecules. The vectors can 
function in prokaryotic or eukaryotic cells or in both (shuttle 
vectors). 

Expression vectors contain cis-acting regulatory regions 
that are operably linked in the vector to the nucleic acid 
molecules such that transcription of the nucleic acid mol- 
ecules is allowed in a host cell. The nucleic acid molecules 
can be introduced into the host cell with a separate nucleic 
acid molecule capable of affecting transcription. Thus, the 
second nucleic acid molecule may provide a trans-acting 
factor interacting with the cis-regulatory control region to 
allow transcription of the nucleic acid molecules from the 
vector. Alternatively, a trans-acting factor may be supplied 
by the host cell. Finally, a trans-acting factor can be pro- 
duceS from the vector itself It is understood, however, that 
in some embodiments, transcription and/or translation of the 
nucleic acid molecules can occur in a cell-free system. 

The regulatory sequence to which the nucleic acid mol- 
ecules described herein can be operably linked include 
promoters for directing mRNA transcription. These include, 
but are not limited to, the left promoter from bacteriophage 
X, the lac, TRP, and TAC promoters from E. coli, the early 
and late promoters from SV40, the CMV immediate early 
promoter, the adenovirus early and late promoters, and 
retrovirus long-terminal repeats. 

In addition to control regions that promote transcription, 
expression vectors may also include regions that modulate 
transcription, such as repressor binding sites and enhancers. 
Examples include the SV40 enhancer, the cytomegalovirus 
immediate early enhancer, polyoma enhancer, adenovirus 
enhancers, and retrovirus LTR enhancers. 

In addition to containing sites for transcription initiation 
and control, expression vectors can also contain sequences 
necessary for transcription termination and, in the tran- 
scribed region a ribosome binding site for translation. Other 
regulatory control elements for expression include initiation 
and termination codons as well as polyadenylation signals. 
The person of ordinary skill in the art would be aware of the 
numerous regulatory sequences that are useful in expression 
vectors. Such regulatory sequences are described, for 
example, in Sambrook et al., Molecular Cloning: A Labo- 
ratory Manual. 2nd. ed., Cold Spring Harbor Laboratory 
Press, Cold Spring Harbor, N.Y., (1989). 

A variety of expression vectors can be used to express a 
nucleic acid molecule. Such vectors include chromosomal, 
episomal, and virus-derived vectors, for example vectors 
derived from bacterial plasmids, from bacteriophage, from 
yeast episomes, from yeast chromosomal elements, includ- 
ing yeast artificial chromosomes, from viruses such as 
baculoviruses, papovaviruses such as SV40, Vaccinia 
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viruses, adenoviruses, poxviruses, pseudorabies viruses, and 
retroviruses. Vectors may also be derived from combinations 
of these sources such as those derived from plasmid and 
bacteriophage genetic elements, e.g. cosmids and 
phagemids. Appropriate cloning and expression vectors for 
prokaryotic and eukaryotic hosts are described in Sambrook 
el al., Molecular Cloning: A Laboratory Manual. 2nd. ed., 
Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 
N.Y.,(1989). 

The regulatory sequence may provide constitutive expres- 
10 sion in one or more host cells (i.e. tissue specific) or may 
provide for inducible expression in one or more cell types 
such as by temperature, nutrient additive, or exogenous 
factor such as a hormone or other ligand. A variety of vectors 
providing for constitutive and inducible expression in 
prokaryotic and eukaryotic hosts are well known to those of 
ordinary skill in the art. 

The nucleic acid molecules can be inserted into the vector 
nucleic acid by well-known methodology. Generally, the 
DNA sequence that will ultimately be expressed is joined to 
an expression vector by cleaving the DNA sequence and the 
expression vector with one or more restriction enzymes and 
then ligating the fragments together. Procedures for restric- 
tion enzyme digestion and ligation are well known to those 
of ordinary skill in the art. 
The vector containing the appropriate nucleic acid mol- 
25 ecule can be introduced into an appropriate host cell for 
propagation or expression using well-known techniques. 
Bacterial cells include, but are not limited to, E. coli, 
Streptomyces, and Salmonella typhimurium. Eukaryotic 
cells include, but are not limited to, yeast, insect cells such 
as Drosophila, animal cells such as COS and CHO cells, and 
plant cells. 

. As described herein, it may be desirable to express the 
peptide as a fusion protein. Accordingly, the invention 
provides fusion vectors that allow for the production of the 
peptides. Fusion vectors can increase the expression of a 
recombinant protein, increase the solubility of the recombi- 
nant protein, and aid in the purification of the protein by 
acting for example as a ligand for affinity purification. A 
proteolytic cleavage site may be introduced at the junction 
of the fusion moiety so that the desired peptide can ulti- 
mately be separated from the fusion moiety. Proteolytic 
enzymes include, but are not limited to, factor Xa, thrombin, 
and enterokinase. Typical fusion expression vectors include 
pGEX (Smith et al., Gene 67:31^*0 (1988)), pMAL (New 
England Biolabs, Beverly, Mass.) and pRIT5 (Pharmacia, 
45 Piscataway, NJ.) which fuse glutathione S-transferase 
(GST), maltose E binding protein, or protein A, respectively, 
to the target recombinant protein. Examples of suitable 
inducible non-fusion E. coli expression vectors include pTrc 
(Amann et al., Gene 69:301-315 (1988)) and pET 11 d 
(Studier et al., Gene Expression Technology: Methods in 
Enzymplogy 185:60-89 (1990)). 

Recombinant protein expression can be maximized in 
host bacteria by providing a genetic background wherein the 
host cell has an impaired capacity to proteolytically cleave 
the recombinant protein. (Gottesman, S., Gene Expression 
Technology: Methods in Enzymology 185, Academic Press, 
San Diego, Calif. (1990) 119-128). Alternatively, the 
sequence of the nucleic acid molecule of interest can be 
altered to provide preferential codon usage for a specific 
host cell, for example £. colL (Wada et al., Nucleic Acids 
Res. 20:2111-2118 (1992)). 

The nucleic acid molecules can also be expressed by 
expression vectors that are operative in yeast. Examples of 
vectors for expression in yeast e.g., 5. cerevisiae include 
pYepSecl (Baldari, et al., EMBO 7. 6:229-234 (1987)), 
65 pMFa (Kurjan et al., Cell 30:933-943(1982)), pJRY88 
(Schultz et al., Gene 54:113-123 (1987)), and pYES2 
(Invitrogen Corporation, San Diego, Calif.). 
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The nucleic acid molecules can also be expressed in insect 
cells using, for example, baculovirus expression vectors. 
Baculovinis vectors available for expression of proteins in 
cultured insect cells (e.g., Sf 9 cells) include the pAc series 
(Smith ct al., Mol Cell Biol 3:2156-2165 (1983)) and the 
pVL scries (Lucklow et al., Virology 170:31-39 (1989)). 
1 In certain embodiments of the invention, the nucleic acid 
molecules described herein are expressed in mammalian 
cells using mammalian expression vectors. Examples of 
mammalian expression vectors include pCDM8 (Seed, B. 
Nature 329:840(198*0) and pMT2PC (Kaufman et al., 
EMBOJ. 6:187-195 (1987)). 

The expression vectors listed herein are provided by way 
of example only of the well-known vectors available to 
those of ordinary skill in the art that would be useful to 
express the nucleic acid molecules. The person of ordinary 
skill in the art would be aware of other vectors suitable for 
maintenance propagation or expression of the nucleic acid 
molecules described herein. These are found for example in 
Sambrook, J., Fritsh, E. R, and Maniatis, T. Molecular 
Cloning: A Laboratory Manual 2nd, ed., Cold Spring 
Harbor Laboratory, Cold Spring Harbor Laboratory Press, 
Cold Spring Harbor, N.Y., 1989. 

The invention also encompasses vectors in which the 
nucleic acid sequences described herein are cloned into the 
vector in reverse orientation, but operably linked to a 
regulatory sequence that permnits transcription of antisense 
RNA. Thus, an antisense transcript can be produced to all, 
or to a portion, of the nucleic acid molecule sequences 
described herein, including both coding and non-coding 
regions. Expression of this antisense RNA is subject to each 
of the parameters described above in relation to expression 
of the sense RNA (regulatory sequences, constitutive or 
inducible expression, tissue-specific expression). 

The invention also relates to recombinant host cells 
containing the vectors described herein. Host cells therefore 
include prokaryotic cells, lower eukaryotic cells such as 
yeast, other eukaryotic cells such as insect cells, and higher 
eukaryotic cells such as mammalian cells. 

The recombinant host cells are prepared by introducing 
the vector constructs described herein into the cells by 
techniques readily available to the person of ordinary skill in 
the art. These include, but are not limited to, calcium 
phosphate transfection, DEAE-dextran-mediated 
transfection, cationic lipid-mediated transfection, 
electroporation, transduction, infection, lipofection, and 
other techniques such as those found in Sambrook, et al. 
{Molecular Cloning: A Laboratory Manual 2nd, ed, Cold 
Spring Harbor Laboratory, Cold Spring Harbor Laboratory 
Press, Cold Spring Harbor, N.Y., 1989). 

Host cells can contain more than one vector. Thus, dif- 
ferent nucleotide sequences can be mtroduced on different 
vectors of the same cell. Similarly, the nucleic acid mol- 
ecules can be introduced either alone or with other nucleic 
acid molecules that are not related to the nucleic acid 
molecules such as those providing trans-acting factors for 
expression vectors. When more man one vector is intro- 
duced into a cell, the vectors can be introduced 
independently, co-introduced or joined to the nucleic acid 
molecule vector. 

In the case of bacteriophage and viral vectors, these can 
be introduced into cells as packaged or encapsulated virus 
by standard procedures for infection and transduction. Viral 
vectors can be replication-competent or replication- 
defective. In the case in which viral replication is defective, 
replication will occur in host cells providing functions that 
complement the defects. 

Vectors generally include selectable markers that enable 
the selection of the subpopulation of cells that contain the 
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recombinant vector constructs. The marker can be contained 
in the same vector that contains the nucleic acid molecules 
described herein or may be on a separate vector. Markers 
include tetracycline or ampicillin-resistance genes for 
5 prokaryotic host cells and dihydrofolate reductase or neo- 
. mycin resistance for eukaryotic host cells. However, any. 
marker that provides selection for a phenotypic trait will be 
effective. ' C . 

While the mature proteins can be produced in bacteria, 
]0 yeast, mammalian cells, and other cells under the control of 
the appropriate regulatory sequences, cell-free transcription 
and translation systems can also be used to produce these 
. proteins using RNA derived from the DNA constructs 
described herein. 

Where secretion of the peptide is desired, which is diffi- 
cult to achieve with multi-transmembrane domain contain- 
ing proteins such as kinases, appropriate secretion signals 
are incorporated into the vector. The signal sequence can be 
endogenous to the peptides or heterologous to these pep- 
tides. 

Where the peptide is not secreted into the medium, which 
is typically the case with kinases, the protein can be isolated 
from the host cell by standard disruption procedures, includ- 
ing freeze thaw, sonication, mechanical disruption, use of 
lysing agents and the like. The peptide can then be recovered 
and purified by well-known purification methods including 
ammonium sulfate precipitation, acid extraction, anion or 
cationic exchange chromatography, phosphocellulose 
chromatography, hydrophobic-interaction chromatography, 
affinity chromatography, hydroxylapatile chromatography, 
30 lectin chromatography, or high performance liquid chroma- 
tography. 

It is also understood that depending upon the host cell in 
recombinant production of the peptides described herein, the 
peptides can have various glycosylation patterns, depending 
35 upon the cell, or maybe non-glycosylated as when produced 
in bacteria. In addition, the peptides may include an initial 
modified methionine in some cases as a result of a host- 
mediated process. 

Uses of Vectors and Host Cells 

The recombinant host cells expressing the peptides 
described herein have a variety of uses. First, the cells are 
useful for producing a kinase protein or peptide that can be 
further purified to produce desired amounts of kinase protein 
45 or fragments. Thus, host cells containing expression vectors 
are useful for peptide production. 

Host cells are also useful for conducting cell-based assays 
involving the kinase protein or kinase protein fragments, 
such as those described above as well as other formats 
known in the art Thus, a recombinant host cell expressing 
a native kinase protein is useful for assaying compounds that 
stimulate or inhibit kinase protein function. - 

Host cells are also useful for identifying kinase protein 
mutants in which these functions are affected. If the mutants 
naturally occur and give rise to a pathology, host cells 
containing the mutations are useful to assay compounds that 
have a desired effect on the mutant kinase protein (for 
example, stimulating or inhibiting function) which may not 
be indicated by their effect on the native kinase protein. 

Genetically engineered host cells can be further used to 
produce non-human transgenic animals. A transgenic animal 
is preferably a mammal, for example a rodent, such as a rat 
or mouse, in which one or more of the cells of the animal 
include a transgene. A transgene is exogenous DNA which 
is integrated into the genome of a cell from which a 
65 transgenic animal develops and which remains in the 
genome of the mature animal in one or more cell types or 
tissues of the transgenic animal. These animals are useful for 
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studying the function of a kinase protein and identifying and 
evaluating modulators of kinase protein activity. Other 
examples of transgenic animals include non-human 
primates, sheep, dogs, cows, goats, chickens, and amphib- 
ians. 

A transgenic animal can be produced by introducing 
nucleic acid into the male pronuclei of a fertilized oocyte, 
e.g., by microinjection, retroviral infection, and allowing the 
oocyte to develop in a pseudopregnant female foster animal. 
Any of the kinase protein nucleotide sequences can be 
introduced as a transgene into the genome of a non-human 
animal, such as a mouse. 

Any of the regulatory or other sequences useful in expres- 
sion vectors can form part of the transgenic sequence. This 
includes intronic sequences and polyadenylation signals, if 
not already included. A tissue-specific regulatory sequence 
(s) can be opcrably linked to the transgene to direct expres- 
sion of the kinase protein to particular cells. 

Methods for generating transgenic animals via embryo 
manipulation and microinjection, particularly animals such 
as mice, have become conventional in the art and are 
described, for example, in U.S. Pat. Nos. 4,736,866 and 
4,870,009, both by Leder et al, U.S. Pat. No. 4,873,191 by 
Wagner et al. and in Hogan, B., Manipulating the Mouse 
Embryo, (Cold Spring Harbor Laboratory Press, Cold Spring 
Harbor, N.Y., 1986). Similar methods are used for produc- 
tion of other transgenic animals. A transgenic founder ani- 
mal can be identified based upon the presence of the 
transgene in its genome and/or expression of transgenic 
mRNA in tissues or cells of the animals. A transgenic 
founder animal can then be used to breed additional animals 
carrying the transgene. Moreover, transgenic animals carry- 
ing a transgene can further be bred to other transgenic 
animals carrying other transgenes. A transgenic animal also 
includes animals in which the entire animal or tissues in the 
animal have been produced using the homologously recom- 
binant host cells described herein. 

In another embodiment, transgenic non-human animals 
can be produced which contain selected systems that allow 
for regulated expression of the transgene. One example of 
such a system is the cre/loxP recombinase system of bacte- 
riophage PI. For a description of the cre/loxP recombinase 
system, see, e.g., Lakso et al. PNAS 89:6232-6236 (1992). 
Another example of a recombinase system is the FLP 
recombinase system of 5. cerevisiae (O'Gorman et al. Sci- 
ence 251:1351-1355 (1991). If a cre/loxP recombinase 
system is used to regulate expression of the transgene, 
animals containing transgenes encoding both the Cre recom- 
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binase and a selected protein is required. Such animals can 
be provided through the construction of "double" transgenic 
animals, e.g., by mating two transgenic animals, one con- 
taining a transgene encoding a selected protein and the other 
containing a transgene encoding a recombinase. 

Clones of the non-human transgenic animals described 
herein can also be produced according to the methods 
described in Wilmut, I. et al. Nature 385:81(^813 (1997) 
and PCT International Publication Nos. WO 97/07668 and 
WO 97/07669. In brief, a cell, e.g., a somatic cell, from the 
transgenic animal can be isolated and induced to exit the 
growth cycle and enter G 0 phase. The quiescent cell can then 
be fused, e.g., through the use of electrical pulses, to an 
enucleated oocyte from an animal of the same species from 
which the quiescent cell is isolated. The reconstructed 
oocyte is then cultured such that it develops to morula or 
blastocyst and then transferred to pseudopregnant female 
foster animal. The offspring born of this female foster 
animal will be a clone of the animal from which the cell, e.g., 
the somatic cell, is isolated. 

Transgenic animals containing recombinant cells that 
express the peptides described herein are useful to conduct 
the assays described herein in an in vivo context. 
Accordingly, the various physiological factors that are 
present in vivo and that could effect substrate binding, 
kinase protein activation, and signal transduction, may not 
be evident from In vitro cell-free or cell-based assays. 
Accordingly, it is useful to provide non-human transgenic 
animals to assay in vivo kinase protein function, including 
substrate interaction, the effect of specific mutant kinase 
proteins on kinase protein function and substrate interaction, 
and. the effect of chimeric kinase proteins. It is also possible 
to assess the effect of null mutations, that is, mutations that 
substantially or completely eliminate one or more kinase 
protein functions. 

All publications and patents mentioned in the above 
specification are herein incorporated by reference. Various 
modifications and variations of the described method and 
system of the invention will be apparent to those skilled in 
the art without departing from the scope and spirit of the 
invention. Although the invention has been described in 
connection with specific preferred embodiments, it should 
be understood that the invention as claimed should not be 
unduly limited to such specific embodiments. Indeed, vari- 
ous modifications of the above-described modes for carrying 
out the invention which are obvious to those skilled in the 
field of molecular biology or related fields are* intended to be 
within the scope of the following claims. 



SEQUENCE LISTING . 

<160> NUMBER OF SEQ ID NOS: 4 

<210> SEQ ID NO 1 
<2U> LENGTH: 2320 
<212> TYPE: DNA 
<213> ORGANISM: Human 

<400> SEQUENCE l 1 

cceagggcgc cgtaggcggt geatecegtt cgcgcetggg gctgtggtct tcccgcgcct 60 

gaggeggegg eggcaggage tgaggggagt tgtagggaac tgaggggagc tgctgtgtcc 120 

cccgcctcct cctccccatt tccgcgctcc egggaccatg teegegctgg egggtgaaga 180 

tgtctggagg tgtccaggct gtggggacca cattgctcca agceagatat ggtacaggae 240 
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tgtcaacgaa acctggcacg gctcttgctt ccggtgaaag tgatgcgcag cctggaccac 
cccaatgtgc tcaagttcat tggtgtgctg tacaaggata agaagctgaa cctgctgaca 

gagtacattg .aggggggcac' actgaaggac tttctgcgca gtatggatcc gttcccctgg 

- 

cagcagaagg tcaggtttgc caaaggaatc gcctccggaa.tggacaagac tgtggtggtg 
gcagactttg ggctgtcacg gctcatagtg gaagagagga aaagggcccc catggagaag 
gccaccacca agaaacgcac cttgcgcaag aacgaccgca agaagcgcta cacggtggtg 
ggaaacccct actggatggc ccctgagatg ctgaacggaa agagctatga tgagacggtg 
gatatcttct cctttgggat cgttctctgt gagatcattg ggcaggtgta tgcagatcct 
gactgccttc cccgaacact ggactttggc ctcaacgtga agcttttctg ggagaagttt 
gttcccacag attgtccccc ggccttcttc ccgctggccg ccatctgctg cagactggag 
cctgagagca gaccagcatt ctcgaaattg gaggactcct ttgaggccct ctccctgtac 
ctgggggagc tgggcatccc gctgcctgca gagctggagg agttggacca cactgtgagc 
atgcagtacg gcctgacccg ggactcacct ccctagccct ggcccagccc cctgcagggg 
ggtgttctac agccagcatt gcccctctgt gccccattcc tgctgtgagc agggccgtcc 
gggcttcctg tggattggcg gaatgtttag aagcagaaca aaccattcct attacctccc 
caggaggcaa gtgggcgcag caccagggaa atgtatctcc acaggttctg gggcctagtt 
actgtctgta aatccaatac ttgcctgaaa gctgtgaaga agaaaaaaac ccctggcctt 
tgggccagga ggaatctgtt actcgaatcc acccaggaac tccctggcag tggattgtgg 
gaggctcttg cttacactaa tcagcgtgac ctggacctgc' tgggcaggat cceagggtga 
acctgcctgt gaactctgaa gtcactagtc cagtrtgggtg caggaggact tcaegtgtgt 
ggacgaaaga aagactgatg gctcaaaggg tgtgaaaaag tcagtgatgc tccccctttc 
tactccagat cctgtccttc ctggagcaag gttgagggag taggttttga agagtccctt 
aatatgtggt ggaacaggcc aggagttaga gaaagggctg gcttctgttt acctgctcac 
tggctctagc cagcccaggg accacatcaa tgtgagagga agcctccacc tcatgttttc 
aaacttaata ctggagactg gctgagaact tacggacaac atcctttctg tctgaaacaa 
acagtcacaa gcacaggaag aggctggggg actagaaaga ggccctgccc tctagaaagc 
tcagatcttg gcttctgtta ctcatactcg ggtgggctcc ttagtcagat gcctaaaaca 
ttttgcctaa agctcgatgg gttctggagg acagtgtggc ttgtcacagg cctagagtct 
gagggagggg agtgggagtc tcagcaatct cttggtcttg gcttcatggc aaccactgct 
cacccttcaa catgcctggt ttaggcagca gcttgggctg ggaagaggtg gtggcagagt . 
ctcaaagctg agatgctgag agagatagct ccctgagctg ggccatctga cttctacctc 
ccatgtttgc tctcccaact cattagctcc tgggcagcat cctcctgagc eacatgtgca 
ggtactggaa aacctccatc ttggctccca gagctctagg aactcttcat cacaactaga 
tttgcctctt ctaagtgtct atgagcttgc accatattta ataaattggg aatgggtttg 
gggtattaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa 



300 
360 
. 420 " 
480 
S40 
600 
660 
720 
780 
840 
900 
960 
1020 
1080 
1140 
1200 
1260 
1320 
1380. 
1440 
1500 
1560 
1620 
1680 
1740 
1800 
1860 
1920 
1980 
2040 
2100. 
2160 
2220 
2280 
2320 



<210> SEQ ID HO 2 
<211> LENGTH: 255 
<212> TYPE: PRT 
<213> ORGANISM: Human 

<400> SEQUENCES 2 



Met Val Gin Asp Cys Gin Arg Aan Leu Ala Arg Leu Leu Leu Pro Val 
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15 10 15 

Lys Val Met Arg Ser Leu Asp His Pro Asn Val Leu Lys Fhe lie Gly 

'20 25. 30 

Val Leu Tyr Lys Asp Lys Lys Leu Asn Leu Leu Thr Glu Tyr lie Glu 
35 40 45 

Gly Gly Thr Leu Lys Asp Phe Leu Arg Ser Met Asp Pro Phe Pro Trp 
50 55 60 

Gin Gin Lys Val Arg Phe Ala Lys Gly He Ala Ser Gly Met Asp Lys 
65 70 75 80 

Thr Val Val Val Ala Asp Phe Gly Leu Ser Arg Leu He Val Glu Glu 

85 90 95 

Arg Lys Arg Ala Pro Met Glu Lys Ala Thr Thr Lys Lys Arg Thr Leu 
100 105 110 

Arg Lys Asn Asp Arg Lys Lys Arg Tyr Thr Val Val Gly Asn Pro Tyr 
115 120 125 

Trp Met Ala Pro Glu Met Leu Asn Gly Lys Ser Tyr Asp Glu Thr Val 
130 .135 140 

Asp He Phe Ser Phe Gly He Val Leu Cys Glu He He Gly Gin Val 
145 150 155 160 

Tyr Ala Asp Fro Asp Cys Leu Pro Arg Thr Leu Asp Phe Gly Leu Asn 

165 170 . 175 

Val Lys Leu Phe Trp Glu Lys Phe Val Pro Thr Asp Cys Pro Pro Ala 
180 185 190 

Phe Phe Pro Leu Ala Ala He Cys Cys Arg Leu Glu Pro Glu Ser Arg 
.195 .200 205 

Pro Ala Phe Ser Lys Leu Glu Asp Ser Phe Glu Ala Leu Ser Leu Tyr 
210 215 220 

Leu Gly Glu Leu Gly He Pro Leu Pro Ala Glu Leu Glu Glu Leu Asp 
225 230 235 240 

His Thr Val Ser Met Gin Tyr Gly Leu Thr Arg Asp Ser Pro Pro 

245 250 255 



<210> SEQ ID HO 3 
<211> LENGTH i 59065 
<212> TYPE: DNA 
<213> ORGANISM: Human 

<400> SEQUENCE: 3 

tcatccttgc gcaggggcca tgctaacctt ctgtgtctca gtccaatttt aatgtatgtg 60 

ctgctgaagc gagagtacca gaggtttttt tgatggcagt gacttgaact tatttaaaag 120 

.ataaggagga gccagtgagg gagaggggtg ctgtaaagat aactaaaagt gcacttcttc 180 

taagaagtaa gatggaatgg gatccagaac aggggtgtca taccgagtag cccagccttt '240 

gttccgtgga cactggggag tctaacccag agctgagata gcttgcagtg tggatgagcc 300 

agctgagtac agcagatagg gaaaagaagc caaaaatctg aagtagggct ggggtgaagg 360 

acagggaagg gctagagaga catttggaaa gtgaaaccag gtggatatga gaggagagag 420 

tagagggtct tgatttcggg tctttcatgc ttaacccaaa gcaggtacta aagtatgtgt 480 

tgattgaatg tctttgggtt tctcaagact ggagaaagca gggcaagctc tggagggtat 540 

ggcaataaca agttatcttg aatatcctca tggtggaaag tcctgatcct gtttgaattt 600 

tggaaataga aatcattcag agccaagaga ttgaattgtt gagtaagtgg gtggtcaggt 660 

tacagactta attttgggtt aaaaagtaaa aacaagaaac aaggtgtggc tctaaaataa 720 
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tgagatgtgc tgggggtggg geatggcagc tcataaaetg accctgaaag ctcttacatg 780 
taagagttcc aaaaatattt ccaaaacttg gaagattcat ttggatgttt gtgttcatta 84 0 
aaatctctca ctaattcatt .gtcttgtcca ctgtccgtaa cccaacctgg gattggtttg 900 
agtgagtctc tcagactttc tgccttggag tttgtgagag agatggcata ctctgtgacc 960 
actgtcaccc taaaaccaaa aaggcccctc ttgacaagga gtctgaggat tttagaccca 1020 
ggaagaatga gtgatgggca tatatatatc ctattactga ggcatgagaa gagtggaatg 1080 
ggtgggttga ggtggtgttt taaggcctct tgccagcttg tttaactctt ctctggggaa 1140 
cgagggggac aactgtgtac attggctgct ccagaatgat gttgagcaat cttgaagtgc 1200 
caggagctgt gctttgtcta ttcatggccc ctgtgcctgt gaaacagggt tcggtgactg 1260 
tcactgtgcc tgtggcagtc tgtagttacc cagagagaac aaagctgcat acacagagcg 1320 
cacaagggag tcttgtaaca accttgtcct gctttctagg gctgagtcag gtaccacagc 1380 
ttgatctcag ctgtcctctt tatttcaaga agttgacatc tgagccatac caggagtatt 1440 
gtattttgtt tgaggcctct ctttttggag gaacatggac cgactctgtg cttttgtcta 1500 
tgctggtctc tgagctcaca caacccttca ccctcctttc tcagccagtg ataggtaagt 1560 
cttccctatc ttgcaaggct cagctcaagt gtcagcttcc tctacaaaga ctttectggt 1620» 
tcccctcatt ggagtgaaca agagttgaca tggtagaatg gaaagagcag aagctttaga 1680 
atgagccaga cctgagtatg aatgctagat ccaccactta gctagtcaac cctgccccct 1740 
gcctcaagtt ttaattttcc tatccattaa gtgaatataa taatacctgt gtcacaggat .1800 
tattttgaga attaaatgag attaggtcta tgaaagcacc tagcagagtt cttggcatat .1860 
aggaggcatt cattaaatat ttgttcttcc ccttttatac ccattacttt tctttttctg 1920 
aactoaaata atacttggtt ctatctctga aataacatcc aagtgaaaaa tcaacaacat 1980 
gaaagagcag ttcttttcca gtggatttgc ttcttaagga gcagagatta tgtaatctaa 2040 
cagcctccaa catacaaaga gctttgtatc tagaacaggg gtccccagcc cctggaccgc 2100 
caactggtac gggtctgtag cctgttagga accaggctgc acagcaggag gtgagcggcg 2160 
ggccagtgag cattgctgcc tgagctctgc ctcctgtcag atcagtggtg gcattagatt 2220 
ctcataggag tgtgaaccct attgtgaact gcacatgcaa gggatctggg ttgcatgctc 2280 
cttatgagaa tctcactaat ggctgatgat ctgagttgga acagtttgat accaaaacca 2340 
tccccccgcc ccccaacccc cagcctaggg tccgtggaaa aattggcccc tggtgccaaa 2400 
aaggttgagg actgctgatc tagaggacca atttattcaa tgttggttga gtaaatgagc 2460 
tcttggatta ggtgatggaa aaatctgaaa aaacagggct tttgaggaat aggaaaaggc 2520 
agtaacatgt ttaacccaga gagaagtttc tggctgttgg ctgggaatag. tcataggaag 2580 . 
ggctgacact gaaaagaagg agattgtgtt cgtttcttct tctcagagct ataagcaaag 2640 
gctgaaagtt ctagaaaaag gcaagttttg tttcagtaga aaaaaggata atcagaacca 2700 
tttttagaaa atggaatgag actacttttg aggccatgag ttccttgtcc ctggagagat 2760 
gagcagaggt tggacaagtg cttaccagag atcttgtgga ggcagaaact gtgcatctag 2820 
cagagcattg gcctaaccct ttcaaatgag atgctgttaa ctcagtctta ttctacatgg 2880 
taggaatcct gtccctttgc ctcctgctac tttgggcctc tcaacctctt ggttttgtgt 2940 
gcaggtgaag atgtctggag gtgtccaggc tgtggggacc acattgctcc aagccagata 3000 
tggtacagga ctgtcaacga aacctggcac ggctcttgct tccggtaggt gggcctatcc 3060 
tcccatcttt accagtgtac tatgggccaa gcactatttc atgttctgat ggaaaacaca 3120 
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gaaacaagct tctgagttga gaatttcaat cttagggtgg ggaaaggaat gtaccaagga 3180 
agagctcatg accaaacctc aagtgtggcc cccctgaacc caggttaaat tggaagagcc 3240 
ataaatgggc cagctggagg cagggtgggg ggatgagagg agccctttcc agggttgtcc 3300 
catatccctc actttatggg tgaggaaact gaggcccagg aagagtgact ttcctgtggc 3360 
tgcactacag attatgcagg tacttcaaga gttgtttgta ttcttatttt attttatttt 3420 
attttatttt attttatttt attttatgag agggattctt gctgttgccc aggctggagt 3480 
gcagtggtgc aatctcggct cactgcaatc tctgcctgct gggttcaagt gatttttctg 3540 
ccttagcttc ctgagtagct gagatgacag gcacctgcca ccatgcgcag ctaatttttg 3600 
tattttagtg gagacggggg tttcaacatg ttggtcaggc tggtcttgaa ctcctgacct 3660 
caaatgatgc acccacctcg acctcccaaa gtgctggaat tacaggcgtg aaccactgtg 3720 
cccagccaag agttgttttt agtgtggttg gcagagccag ctcttccttc accacaggat 3780 
gcctccctag gttcctactt tttgttacta gcttttatta tagctatatt attattatta 3840 
ttattattat tattattatt attattgaga cagagtctcg ctctgtcgcc caggctggtg 3900 
tacagtggtg cgatcccggg ctcactgcaa cctctgcctc ccgagttcaa gcagttctcc 39-50 
tgcctcagcc ccccgagtag gtgggactac aggcgcctgc caccacaccc ggctaatttt 4020 
tgtattttta gtagagacgg ggtttcacct tgttgaccag gctggtctgg agctcctgac 4080 
ctcaggtaag tgctagaatc acaggcgtga accactgcgc ccagccaaga gttgttttta 4140 
gtgtggttgg cagagccagc tcttcctcac cacaggttgc ctccctaggt tcctactttt 4200 
tgttactagc tttattatag ctacattatt attattattg ttattattat tgagacagag 4260 
tctcgctctg tcgcccaggc tggtgtacag tgatgtgatc ttggctcact gcaacctctg 4320 
ccccccgagt tcaagcaatt ctcctgcttc agccccccta gtaggtggga ctccaggcac 4380 
ctgccaccac gcccagctaa tttttgtatt tttagtagag gcggggtttc accttgttgg 4440 
ccaggctggt ctcaaactcc tgacctcagg tgatccgcct gcctcggcct cccaaaatgt 4500 
tgggattaca ggcatgagcc accgcgccct gcctatagct acattatttt tgtaggcagc 4560 
tcagtttctt aaaaattata cagacttcaa atcagatttg ttcctgctgt ctgaggctca 4620 
gtttcttcat ctggaaaatg gatggtaata atcttgttga gattgaatga aataatatat 4680 
gcagtgtatc cagtacatgg tagacaccca gtgaatggtt attccttcct cccatcggat 4740 
tggaattctc aagggtggga acttgtcttt atattcttca caacgtaaaa tagttgaaat 4800 
ttgttggtgg aaagaagagc agtccactcc agaggctgga tgggcatgcc tggcccccaa 4860 
ggtctgaagt ggtagggctg tgcctatatc ctgagaatga gatagactag gcaggcacct 4920 
tgtgctgtag attccagctc ctgcacatag ctcttgttgt aaaacatccc tgtgcttata 4980 
ccaagtaatt gagttgacct. ttaaacactt gcctcttccc tgggaaccat ataggggatt 5040 
ggcctggaga cgtctggcct ctggaagagt tggaaagcag ccatcattat tatcctttcc 5100 
tttcagctat aactcagagc tctcaagtct tttctgtgga tcttattgcc ttggttcttg 5160 
ccccttttac tcccagggaa gttgattctg tcttttctgt tccatttagt atgacaggag 5220 
cagagaatgt cagagctgta agggacctta tagttaaagc ctttggctgg tcctttcatt 5280 
ttatagctgg gactaataag taacgtcaaa acccaatgag ttcacagatt gggtctcgcc 5340 
ttggcatgta acccatatgt tcatattctt gctgttttcc tatgtgtatg aatattttct 5400 
atccaaaata agcaggacag ggtagagcaa gttaatcttt ggaatttctg gattctctta 5460 
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gagctaaaaa acttcagaac tagaagaaac cacccactat atggtataac ccattcatat 5520 

. cacagatgag gcctgaaacc aaaaagactt gctcaggcca tggatgacaa gagctggccc 5580 

... - ■ ■ 

tagcactgaa ctcttgggtc atttgtaggt ctag-tcagat gctagcttgt tagctctgtg 5640 

cgtgcgtgtg tgtgtgtgtg tgtgtgtgtg tgtgtgagat agagacagaa agataacata 5700 

tgtacacaaa tacataaaga ggaagtagac acgttagcat ggtagataag agtacaggca 5760 

ggccaggcgt ggtggctcac gcctgtaatc ccagcacttt gggaggccaa ggcaggtgga 5820 

tcacctgagg tcaggaattc gagaccagcc tgaccaacat ggtgaaaccc catctctact 5880 

aaatacagaa aaaaattagc ttggcatggt ggcacatgcc tgtaatccca gctacttggg 5940 

eagctgaagc aggagaatcg cttgaatccg ggaagcagaa gttgcagtga gccgagattg 6000 

tgccattaca gtctagcctg ggcaacaaga gggaaactcc atcgcaaaaa aacaaccacc 6060 

accaagagta caggctatgg aatgagacta tggttttaaa tcctggcttt gcaatttatt 6120 

aactagcctt aagtgacttc cctgagcttc aggcaccaat ctgtaaaatg aggataagaa 6180 

tattactcat gccacatggt tgttagggag gattaaatgt gataacctat ataaagtggc 6240 

tagcatagca tctgacatat agaaaactct taatagggcc ggacgtggtg gcttatgcct 6300 

• 

gtaatcctag cactctggga ggccgaggca gaaggatcgc ttgagcccat gagcccagga 6360 
gtttgagacc agcctggcca acatggcaaa actccacctc tacaaaaaat acaaaaatat 6420. 
tagccaggcg tgatggcaca cacctgtagt cccagctact tgggaagctg aggagcgatg 6480 
attacctgag cccagggata tcaaggctgt agtgagctgt gatcatgcca ctgtactcca 6540 

tccagctggg ggacagagtg aaacccctgt ctcaaaacaa -aacaaatgaa aaaaaaaacc ' 6600 
cttaataatc agtaactgtc actttatatt atgttgtgag tgtgtgtcta tatacaccta 6660 
tatgtataca tttctcttat tacacattca ttggtgatct gatgtggagc cccagggatt 6720 
aagggcaact ttgaactacc ctgacacaat caagccaaat atcattcccg tggaggaagt 6780 
agagtatcta ggttctgtct cctagttgca gctttacctt gaggacagag actctaatcc 6840 
agctgtgctg aaggagcaca tctcctgact tctgagcttt cccctggtaa attcaaactg 6900 
gatgtcacgg cgccctcaga tagagcctgg taatttgccc tggggagagt gactgtcttt 6960 
tggatctaat ttgacttttg ccccagttgg aggaaaatct tcagggctag gaaggattgt 7020 
atttgtctga ccccagagat aacctgggtt ttgaggaaca tggggcatca acctgaatgg 7080 

tcttgtaaga tctctcccac gccagcttgc cagtgtttct ctgatgaatt tagagtacct 7140 

gagtagtgca ggcctgctgg gaggaggact ctccctctgt gctactcaga gaaattcatt 7200 

cttcaaggcc cccttccagc ' cttgctctta cccagctggg ctacagttac aataaaggaa.' 7260 

atgacttttc ttctcccctt cccccagtac ctttgttttc ctagtcacag ggtggggctg 7320 

gatattgaat ggagaaattg ctggggtcca tcctaaactc ctcccctcat ctctccctta 7380 

cattacccca ttcttctgtc tgcagccaca tccataatcc tgcctctgtt agccttccga 7440 

cagaccctca ggtgcccagg acaacaggaa gctacttaaa gctggaacct cagactgtgc 7500 

oatggaggcc agtgacaaaa ctgaaagtag ctctgtcagt aattgtgctg gtgcgattag 7560 

gcagctggcc agaatctttt ggatctcctg gacatatggc tgactagtcc tcccaagcct 7620 

tcccaacagg cctctttttt ttcctttttt tcttttcttt tttttctttc tttctttctt 7680 

tctttttttt ttttttttag gctagtgaag tgaaattgtg ggagtggaaa aggaacaaag 7740 

aaatcggtaa ctggtagtga tcaattactt gtaaacacta ttgtacttgg accagcccag 7800 

taggcctttt ttaaaactct gagttacctc tctttccttt ccttgagcag tgccattaat 7860 
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tetgtatctg gggcaatcct ttctgatgtt ctctggacct ggctctctct ccttaggaga 7920 

ggccaggaga gtagccagag agcatgtcat .ttgtagctga ggttaaagtg tggagctatc 7980 

aatggtgacc tggcctcttg gcatgttagc aagccagagg accttgacaa cttttttgat 8040 

gattgtccgt tcaccctgat caaaggtgtt tggcttagga ggagggaaga aaagctaccc 8100 

ctattagtct tgatggcccc agcgtgggtc tctattgctt gacctggttc ctagcagcat 8160 

tatcagaagg aaaatccacc gctcttaagg ctcctgggaa ctttcaggac ttcctttctc 8220 
aggattgcaa acataagact atttgagctt tcacttttga aaagcggtta ctaataccta 



8280 



tactctggga aagggctaat gcagatagaa gactgtggtc actgcatcag gcaacagacc 8340 

atttccgcta aatttagtga ctccaggaag gccagtgaag aaataacaca cgtagcaacc 8400 

agagactgtg ttgtaatatg ttggctgaca gcagggtact ttctgtgatg ctgaaagcca 8460 

cattcatttt ctctcccctc atccccatct aagcaagcct ggtagaatca taattacagt 8520 

aataggtacc acttattgag tactctgtgc cagacaccct cctgagcata cgacatgcat 8580 

agcacattta atccttacaa tgacttaata aaatgtagta ctagtcttac ctacttcgag 8640 
aatagggaaa tggaggttac ttgtttaaag tcacagagct aataggtagc atagctgaga 



8700 



tttgaactca ggcattctta ctccttgcct gcaagagtct cttggcattc ttgaatgcaa 8760 



gcatatttct taacctcact gaggctcagt ttcctcttat ataatatggg gtaaagagcc 
ctcaccctgc ctgccacaca ctggtagtgt cagataacat tgaagggtgt tagtttaaag 



8820 
8880 



gcttcatgga ctctataatg tcaacaaaag tgctgttaac tttcttctgg . gtctcaggct 8940 



9000 
9060 



cctgatgtag agtcagtgga gcaaccctgc catdtgctgt tatgctgttg atgttgctgc 
cacacttact aacctaaacc tttgattctg gctgtggcct tctccagaag gtgtttactc 

atttgtccag tttatctttt aggaaacagc cagcccgtag atcattaagg ctggctattg 9120 

gacagggggc tggggcctgc ctgacagagg aaggaagggc agacatctgg ttcttcctct 9180 

gcccctacaa gagactccag cctgaccaca gagtggtact cctaggatgt agcagcagca 9240 

tatgagcttg aatgtgcctt aatcctgctc tttactttga gaagagagaa ctaaggaccc 9300 

acagatgttt cacagcttct ataggaggca gaggtagaaa aatggagaga gatgaggcca 9360 

gagatagata actgatatta attaaacgtt gtattaagaa cctcacttag attatctgat 9420 

tcaatcttca taataaccct gcaaccccca cctttttttg agaacagggt cttgctctgt 9480 

tgtccaggct acagtgcact ggtacaatca tagttcactg cagtgtcaac ctcctgagct 9540 

caagcaatcc tcccacctca gccttgcaag cagcttggac tacaggcgtg ccaccacacc 9600 

ttgccatttt tttttatttt aagtagaaac aaggtcttat taatactatg ttgcccaggc 9660 

tggtcttgaa ctccagcgat cctcctgccc cagcctccca aagtgcttgg gattacggaa 9720 

gtaagccact gtgcctggcc agtgcaaccc ccattttata ctaaaacagg aaggcccaga 9780 

aaggtttgga gtaacttgtc cagggtcaca cagatgatat ttgaactcag gtctccctgg 9840 

ctcccaagag agtctgcttt ccactaggac tcccaggaga aaaaaaaaaa aaaaaacagt 9900 

agacttggag acagaaaatc tgatttgagt cttagttgag ctaggctaac tgtgtaactg 9960 

tgggcaagtt ccttagcccc tgtgagcctc agtttcttat ctgtaaaatg tcataaaaga 10020 

aatccatctc atggagtagt tgtgatgatc aaggactctg aaaacattag aatggtttaa 10080 

tgtgaaggat tagcagcagc acatggcaac attgtgcatc ttatattaac tatccaaata 10140 

tatcaagcgt catttgctat atataaaagt catcaaatta ggcactgtgg gggatacgga 10200 
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gttggcatac tagcctggcc tcttaattaa ttcattaatt agcttattta tttttgagat 10260 

aggtcttgct ctattgccca ggctggagtg cagtggcatg atgatagctt actatagcct 10320 

caatctccca ggcttaaaca atcctcctga gtagctggga ctacaggcac acactaccat. 10380" - 

w , " \ * , * - « »+" ~ * * » 

gcccagctaa ttttttttta attttttgta gagacagggt cttgctctgt tgcccaggct 10440 

ggtctcaaac tcctgggctc gagatcctcc cacctgggcc tcacaaagtg ttgggattac 10500 

aggtatgagc cacggcacct ggcctggtct cttaactggt tccctaagac agctggaaat 10560 

agagaatgtc atggagcatt cctaaccatg ggctccagcc tggctttcat tctgtttctc 10620 

ccctgaaaca acattccttt agtaatattc cgaataacag cttcatcagt ctgtctaccg 10680 

accactcttc aggcttcatc ttatatgacc tcccaaactg cactaagggt tgtattagag 10740 

aaaagtggat aaagttcgga gtcaggctgc ttgagcttaa atgccagctt cacttaccag 10800 

ccacctgacc atgagtcagc tgcttaacca ttctttgcca cagtttcctt gtctatgaaa 10860 

agggaaatgg ctcccacctc aaaaagttgt taacattaaa ttcaatcatg tattcaaagt 10920 

cctgagcaga atgtctggcc atgactggga cttaacagat gttagcattt attattagta 10980 

tctgtcagtc ttgaaatgtt ctcttccctt ggctttcatg acattccaca ctctcctggt. 11040 

• 

tttctcttac ctctctggta atacctgttt gcttatcctt ctttgtccag ctctgggatg 11100 
ttaccattcc ttcaggcgtg ctgttttctc cttaggcagt cttacacaca ctcatgactt 11160 
ccttccattg tcctccacac actgatgacc ctaaaatcag tatctccagc ctaaaccttt 11220 
ccactgagtt ctagacccat atgttgtact atcaacctgg cttgtccatt tgaatgtctt 11280 
ccaggcactt cagactctct tctctagact ttgctggact ttcactcttc cccctaaaac 11340 
tggctcctct tccactgaaa catgtatgtc attgagaggc aceaccatec acccagtgce 11400 
taagccagaa acctaggaat ccttgatacc tgttctctct catcctgcat atccaagcct 11460 
atcagtttta tctctaaatt atattttggt aggtttactt ctttcctttt ctcccaccac 11520 
caccctgctc caagctacca tcatctcacc tggatgtctg caatagcctc atctcccaca 11580 
gccactctgc accccctaat ctgttctcta tagagcagtt ggaaggagtg atttttgttg 11640 
tttgttttgt tttgttttag acagagtctc actctgttcc ccaaggctgg agtgcagtgg 11700 
cacaatttcg gctcactgca acttctgcct cccgggttta agcaattctc ctgcctcagc 11760 
ctcccaagta gctgggatta aggcaccggc ccccataccc agctaatttt tatattttta 11820 
gtagagatgg ggttttgcca tgttggccaa gctagtctcg aactcctgac ctcaagtgat 11880 
ccacctgcct cggcctccca aagtgctggg attacaggtg tgagccactg cacctggctg 11940 
gaaggagtga tcttaaaaaa aaaaaaaaca aaaaaaaact tgactgtgtc actctgtgtt 12000 
gtctctccta ccttgtatac ttccacaact tcccagtgtt cttggataaa gaccaaaatc / 12060 . 
cttaacttgg ccaggcgcgg tggctcacae ctatcatctc agcactttgg gaggccgagg 12120 
caggcagatc atgaagtcaa gagattgaga ccatcctggc caacatggtg aaaccccatc 12180 
tctactaaaa atacaaaaat tagctggtcg tggtggcgtg tgcctgtagt cccagctact 12240 
tgggeggctg aggcaggaga atcacttgaa cctgggaggc agaggttgca gtgagcccag 12300 
atcacgccac tgcactccag cctggtgaca gagtaagact ccatctcaaa aaaaaaaaaa 12360 
aaaaaaaaaa ttccttaatt tggcctacag tagagccctc cgtaatgtgg cctctctcca 12420 
catctccaca acctcctgct ccctgcactt cagcctcacc tctcttctgg acaggccctc 12480 
cttctgacaa gggctttgtt cattctgctc cctctgccta gaatgccccc ttactctgtt 12540 
cacttaactc ctgcttatcg tttagatctt tacctggatg gctcagagaa atatagaagt 12600 
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aattcctcac cctgaaaaat aggttaggtc cctgttttat gttttcatag acctttcctt 12660 
tgaggctttt tttaaaaaag tagttttaat ctcacattta ttcatgtgat catctcctta 12720 
. atgatatctt aagacctcta atagaacaat ttggtcatgg actgtggggt ttttgcccct 12780 
cattgtgtca gcactgagca tattgttggc ataggaggga tatttgttga atgaattgct 12840 
agaggtggcc aagagatatg atgtaagtca ggcttttccc tgcccttccc cttccccttc 12900 
cccacatcct tcctatagca gccaccgtgg ctgcagttac tgtaaatggc aagacggaat 12960 
cagttccgga cattgggttg ttttagaaaa ttgcctgcaa gtgtcagggt gataagttaa 13020 
agctttgtct tttgccctca gaggagctat cccatagtga gtagaagcca gagaagctga 13080 
ccccaggagt ccttctttcc agcagcaggt cttgagctgc acttctctgt agctacaatc 13140 
caggcaggaa caagccctag gtacctccgg agaggagggc aagagaggaa gaatgagttc 13200 
agctactcta gccaccaaac tgattatgaa ttgccctgaa atctgaaaaa tttcaattcc 13260 
aatcgtaagt ttgttttgtt tcattttgtt ttcttaaatt gtatatttga aagatggcat 13320 
taactaaaga tatatattca atatagagtg gaaaaaatgg aatacttgca tagtatcttt 13380 
tacttatagg tgatttatga tggggagtgg ggtggatagg ttggcagttc ccccaagaag 13440 
ttggaaatga agtttgtcct ctgtgagttg aactaattag atccacaagt aatgaaagca 13500 
gtattgtgtt gtagttaaga gcacactcta gaaccagatt gcttagtttc aaatcctggt 13560 
tctgcctttt attatctgtg tactttgggc aagttacttg ccctttgtgt gcttcatttt 13620 
tctcatctag aaaatggaga ggccaggcgt agtggctcat gcctataatc ccagcacttt 13680 
gggaggccga ggcgggcaga tcacctgagg tgagaagttc aagaccagcc tggccaacat 13740 
ggtgaaaccc tgtctctaca aaaatacaaa aattagccag gcatgatggc gggtgcctgt 13800 
aatcccagct acccaggagc ctgaggcggg agaaacactt gaacctggaa ggcagaggtt 13860 
gtagtgagcc aggattgcac cactgcactc cagcctgggt gacaagagct agactcagtc 13920 
taaaaaaaaa aaaaaaaaac aaactggaga tacaggctgg gtgcagggct tacacttata 13980 
atatcagcac tttgggaggc ctaggcggga ggattgcttg aactcaggag tttcaagatc 14040 
agtctgggta acagagcaag acctcatccc cacaaaaaat caaaaattta gccaggcatg 14100 
gtggctcatg cctgtggtcc cagctactca ggaggctgag gcgagaggat tgcttgagcc 14160 
caggaggttg aggctgcagt gaaccatgac tgcaccacta catgccagcc tggatgacag 14220 
agcaagaccc tatctcaaaa aaaaaaaaaa aaagaaacga gccaggcgcg tttgctcacg 14280 
ccagtaatcc cagcactttg ggaggccaag gcaggtggat cacttgaggt caggagatcg 14340 
agactagcct ggccaacatg gtgaaacccc atctcaactg aaaatacaaa aattagccag 14400 
gcatggtggc atgctcctgt agtcccagct actcacttgg aggctgaggc acgagaatcg 14460 - 
cttgaaccca ggaggcggag gttgcagtgg gccaacatca tgtcactgca ctccagcctg 14520 
ggagacagag cgagactctg tctcaataaa taaataaaca taaaataaaa taaaataaaa 14580 
taaaataaaa taaaaaaata tggaggccag caggcacggt ggctcacgca tgtaatccca 14640 
gcactttggg aggccgaggg gggcggatca caaggtcagg agatcgagac catcctggct 14700 
aacacagtga aaccgcgtct ctactaaaaa tacacaaaat tagccaggca tggtggcagg 14760 
cacctgtagt ccctgctact caggaggctg aggcaggaga atggcgtgaa cccgggaggc 14820 
ggagcttgca gtgagctgag atcgcgccac tgcagtccag cctgggcgac agagcaagac 14880 
tctgtctcaa aaaaaaaaaa aaaaatggag gttgggcgcg gtggctcgcg cctgtaatcc 14940 
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cagcactttg ggaggtcgag gcgggcggat cacctgaggt caggagttcc agaccagcct 15000 

ggccaacatg gtgaaacctt gtctctacta aaattacaaa aattagccag gcacgatggc 15060 

aggcacctgt aatcccagct acttaggaga ctaaggcag'g agaatagctt gaacctggga 15120 

gatggaggtt gcagtgtgct gagatcgcgc cactgccctc cagtagagtg agattccgtc 15180 

tcaaaaaaaa aaaaaaagaa gaaatggaga tacaaactta ctacctacct ccttacaacc 15240 

taccctcaca gtattactgt gaataaaagt gtgtgtagca ctgggaacac tattcacaga 15300 

gcactcatga atgtttgttc tttgttatta gttactagag aggcaaatgt ctgccagggc 15360 

tgaataatat gtgtgaattg gtgattgtcg cacatatcta aagaagtagt tatttttttc 15420 

aattaaaact tagtttaaaa accaatataa ggccgagcgc agtggctcac acctgtaatc 15480 

ccagcacttt gggaggccga ggtgggcaga tcatttgagg tcaggagttc gagactagcc 15540 

tggccaacat ggtgaaaccc tgtctctgct aaaaaaaaaa aaaaagtaca aaaattagcc 15600 

aggcatgatg gcaggtccct gtaatcccag ctacttggga ggccgaggca ggagaattgc 15660 

ttgaacccag gaggtggagg ttgtagtgag ccgagtttgt gccactgcac ttcagcctgg 15720 

gtgacagagg gagacactgt ctcaaaaaaa aaaaaaaaaa accaaaacca atataataaa 15780 

taagtggcca gcaatgaaac agaaagtgaa eagttagtga agcaaaacta gtactgtatt 158*40 

cagataaaga tgctgaatct agatttggtc accagaatag ggtcctttgt ggcaacctgg 15900 

gctagtttgg ctgactcacc actgccagga tgaaatttct ttcagtggct actcatttcc 15960 

ctttatttta agtccatgct cacagagcaa ccttctgatg cctaattcag cttcctggga 16020 

tacttaataa caggaagggt ctggaagtag tacctgtata ggggatatga gtgttctgat 16080 

r ■ . * ■ 

tttaatagtc aattcataag tgtacagagg gtttgataaa tggttaggtc agaaccatca 16140 

cagaatgtct acacctcttt ggacattagg aaggtcaaaa acctgaaagg ccaaaagcta 16200 

ggcctagatt agggtcattc accaagaaaa catcagcctt gaagagttct ctgggtggtc 16260 

caccagtcaa ccttcctttg atcacacctc cttcctcgtt gcttctttaa gcattgacct 16320 

gtaatgggta tggaattttt tgctcaccta actccttcct tttacagagg aagaagttga 16380 

agcccagaga gatttaatgg cttgcctaag atcacacgca gattttctgt taaccagggt 16440 

gatttttcag gtgttccctg ccagacgagg gcttttttcc ttgaattgcc tagagatttc 16500 

ttgagatatc cgaagcattt ttcccagtgc agcctggaga aggatgtccc tgtcaacaea 16560 

gcatttgtta ctcaatgtta gacattcaat tttctaatta gtatcatgga gcaacagtgg 16620 

atgattatct ataaggggtt gcaattccat gcttatgtgc ttacagccca tatagacaaa 16680 

tatcagctgt taaaatgaca aggcagtaga gatgtggccc caggacaaag gcatactctg 16740. 

ctgttagtga acactagttg gccagcaaat ttcacatggg catatacacg gccaactgta 16800 

gactttaggc atttataccc attcagagag ccaaactggc aactaaagat cagcattctc 16860 

tttggcattt cagctttgcg ttctgttaaa aatcactgct tgcttaaata cctctgatag 16920 

ctcttcactg cctgtaggca actctttagc ctagcagact tggtctttag tgctctgccc 16980 

ctactctctt ccaccattct ggcctcctgt ctaattgctg cccatatgtg ccatgcacta 17040 

gagcttacag acctgctcag cgttatatga gcataccata ctctttatgc ctcagtgcat 17100 

ttgcacatgt tgttccttca ggccagaatg cctgttactg cetggcaatc agcctattag 17160 

agtctgccaa taccatccca tcttctgtgg aggagccccc cgccaaatcc acccatacct 17220 

ctccccacca atcagagact tcttctctct ttgttattct cttcgttatt ctcttcatac 17280 

ctcagttata tccatttcag tatttgttta cacatctagc atcactctta gagtgtgaaa 17340 
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ttctccaagt gtggagccgt atctagtttg tctttgtatc ccagagctta gcaaagtgcc 17400 
tagaatgtag tgggtgctca gagtgtttgc tgggtgaatg atgtatttgt tgaacgactc 17460 
tttggacact tgaataaagt ccatccagta tgcaccatta ccatctcttc gctctacaat 17520 
attcttttag gcaagagctt atcttttgag gtgataagat aagctcaaac ttatgtagac 17580 
taagacctca gtctgtaaat gtcatcccta agtcttaaac catcaaaacc agggcctcaa 17640 
ggaatggcat gccttctgca actgtagcaa cctgctgtgc ttattttgcc gtgtttttca 17700 
tttttccccc aaaagctaga gtcccttctc ccatgggcag tgctggaagt gtgctaacaa 17760 
attctttctc catactgctt acgattacaa aaaaaaccct cagcatctca tgccagactt 17820 
gagttaaggt tgttttcttt tgtgtgtcag ctgtattctg gtcatgactt cctgatgatg 17880 
ccctatagag attttgctga gatcagaggg tgctccactg ccatcagtag cactgactct 17940 
tgcagaagca ccgtttctga agttggctaa tgtcatccct cacgtttgtt tgtttgaaat 18000 
ttgttttagt tccagagata gcactttcat ggaatgacgc tatcttctag aatcactttt 18060 
tttttttttt tgagttggag tctcgctgtg tcgccaggct ggagtgcagt ggcacaatct 18120 
cagctcactg caatctccac cttccgggtt caagtgattc ccctgcctca gcctcccgag 18180 
gagctgttac tacaggcgca cacccccact cctggctaat tttatgtgtt ttagtagaga 18240 
cggggtttca ccgtgttggc caggatggtc tcgatctcct gactttgtga tctgcctgct 18300 
tcagcctccc aaagtgtrtgg gattacaggt gtgagtcacc gcgcctggcc tagaatcacc 18360 
tttttatacc ataacgtgag caccactgcc gcgtcaccaa ggaaagagag aggcagctac 18420 
tgtggggtta caaatgggta agagtggcac caggaaggtg aaagtctcta cttagccaag 18480 
gcttaacaaa atgtcaatca ccaaacattt atttattaag ctacgttcag gataagaaga 18540 
tgaacaagct atctgtacat tcattttctc gtttgtaaca aggtaatgat "agtgatctat 18600 
cctgcctgcc tctgagggtt attgtgagaa taaaatgaaa tcaagtggaa aagcacttag 18660 
gaaaaagaaa agcattggtt ttcaattgtt agtgtggatc ogaaacactg gggcttgttt 18720 
aaaatgcaga ttcttagccc cagtctcagc gattctgatt ctgtatatct gaagtgggac 18780 
tcaggaatct tgattttcaa caagctgacc agagggtcca atgctgctat tcctttagtt 18840 
acactttcag aaatattact gtaaatcaaa tggcaagaat aaaatagtta tttgaggcag 18900 
ttttagtatg ttggacctgg agtccaaaga cttgggtcaa actccagctt tgtcagttcc 18960 
tagacctgtg accttaaaca gcaaccttct ctgtgaacct tagttccctc aggaacggct 19020 
ctggtcacct cctgctgtac tccattgatg actcaccaca taaggctccc tgggagtccc 19080 
ccaaaccttt gctctcttaa ctccttttac agcctcctac atctcctgca ggtgctgtct 19140 
tctcctcctt tttccaggcc ctgctctgac acagcattca ttctcctctg ggaagggttc 19200 
cttcaatgtg tctccaagca catcacaccc aggaaggacc ctgtggccat atctgtctat 19260 
caccagatca aactacgtga aggcaggcac taggtactgt cagtgcccag cataggcctg 19320 
gcccatacca ggtgtccaca gatgcctagt aaagaaacct atgattcagg acccccatga 19380 
tgagcaacta tagcactaga acagtgataa taactaatgt ttataatgca tcttcagttt 19440 
acagagggct tttgtactca tcatctagtt tagttcctgc aacaacctct tgaggaatat 19500 
agcacaagca ggacaaggga agcccagaga tgttaaataa tttatccaag tttatgctgc 19560 
tgggaagggc agcactgaaa ttaaaagaaa agttttctga gctcaaatcc catgcccttt 19620 
ectcaatgtg agctctagca aggtattcag gaatcctgcc tctacagttc agagcctcaa 19680 
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attgctgggt atgttgagtt cttgtatctg atttttctag atttcctgcc cacattctta 19740 
ctgtctggat atcaggaaag agtttatcaa etgcctgtgg aaatccaaga taaggtctca 19800 
tgatgagtaa cccagtgaaa acatgaagtc aagtctaact agtcactact attt'cactac 19860 
tgctgactcc tgatgatcag ctccttttct aagtgcttac tgtccactta ttccatcatc 19920 
tgcctagaat ttatgtgaag gaatcaaagc aaaaggatca taaggcttcc tttttccagt 19980 
atgtttttcc tcctttttga aaactgggcc agttagctat ctccattttt atttcatgaa 20040 
tacatcccca gcgcctggta tatagtagat atggaacatt acactttgga gatattgcac 20100 
ccattctcca gtttctccaa agttactaac aatggttcca tcactgtgcc aacatatttt 20160 
cttttttcaa tatattggga aataattctc ccagtctgaa aatctgaaca catttcatgt 20220 
gacttggtat cctcatatgt cttgggcttc caattctcca ttcctagttt caagttcatg 20280 
aactgtaaaa caaaggatta gactaaatct ctaaagttct atccagatgc caaattcttt 20340 
tctctttcca tgatacctaa gatagatgcc aaatattgtc ttttacctgg tgtttgtgaa 20400 
catgacatca cattacagga gtagcagata ctaaactctc actctgtaaa acactgactg 20460 
agttccatga gccagatact gaagtgagct tgttcacata tgttctcatt taatgctcat 20520 
aaccctgtga agctgggaat tgctgggaca ttttatttat ttatttattg agacggagtc 20530 
tggctctgtc acctaggctg gtgtgcaatg gcatgatctt ggctcaccgc aacctccgcc 20640 
tcccgggttc aagcgattct cttgcctcag cctccgcagt agctgggatt acggggcaca 20700 
caccaccaca tccagctaat tttgtatttt tagcagagat ggagtttctc catgttggcc 20760 
aggttggtca cgaacacttg acctcaagtg atctgcctgc ctcagcctcc caaagtgctg 20820 
ggattacagg catgagccac catgcctgcc cgggaccctt gttttagaag gatgactgct 20880 
gctataatgt agaaagtgat ttggaagagg ggaggagtgg ggcacgaaag atggttagta 20940 
gatgggggtg gtaatgctta cctttcagta tttggaggct tcggagtcct caaaaattct 21000 
cttccttgat tggagtcctc ccagccaata gagggcttca cacaaacagt ttcttgggtt 21060 
ttgaattgtt tgaccagagc tttcttccga caaaaggttg gggtgattca ttcacttacc 21120 
acaccttgcc tgaacattca cttggggctg ccggttatga aggctattgt tctccagcct 21180 
gtcacagacg ctttgaagac ctgtgcctca gctggttcta aggagtcagt ttgttcagct 21240 
ccgtgccagg tttccaactt atgaaatgtg ctggagatta acacctctcc tgccatttta 21300 
tccctactat aattgccagt caaaggattc ctgcagttgc ctctggcagc cataactgat -21360 
gaatgttctg ccagctgctc tgaggaccta gaagagcagt tttctatcca ggaccagttt 21420 
ccaagggtgg gagggtgaaa tatatcctcc agtgtgacat ttcatctccc agtgatgggt 21480 
ggcttgggcc ctttgaagtt ggctctgagg aaccacacac ttgggtctga gcagccagca 21540 
gcttatcaca tctggtgatc aatccttcaa aggttcctcc tgaagtctga atttttggag 21600 
gtcaaatgga ttccacctgg gaggggcttc tgcttcaact caggacatgg ggagaaggct 21660 
gttcctcttc cagggggagg cagttttcat ggcattgaga tgtcctctca cttattcccc 21720 
acccacccac caagtccttt gtaagaggag tagggggaga ggagagcgcc tgcagcctcc 21780 
tgctcacatt cctagacacc gactcactga gcccgtcgcc gctggaacag cagagctgtg 21840 
tgaaatgtca agaggagtta tgctcatagg ctccctggcc tcagtctctt tgtggcttgc 21900 
atattcttcc attagtactg tgttcatcac atggaaatca gagggtacaa ttaaaagata 21960 
atttgctagt cccagactta atttggggcc cccttcttgc ctgattgaat tacaggggaa 22020 
cataatagat ttttggtgag aaatagttgt ctgtgtggct gggagaaaga ttgctcccag 22080 
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ctctccagct gggcagccct ttcagtatcc cgtatgttat ttccccactt ccagcccacc 22140 
tcacctcctc tgtggccctt gtgtgtcccc tcggctagga tcctgacctc ctgctcaaga 22200 
gtttaaactc aacttgagac.ccaaggaaaa tagagagccc tctgcaacct cataggggtg 22260. 
aaaaatgttg atgctgggag ctatttagag acctaaccaa ggcccagaca gagagagtga 22320 
cttgctaaag gccacatagc tagcccacag tegttgtaac aatagtctta atgatattaa 22380 
tggctaacat ttatcaacct ttaatgtgtc ccagactttg tgccaagggc ttacatgcag 22440 
tgcattgtcg cattcaaacc cagacagtct ggctctgggc ccaggctgag ctttggtata 22500 
gcatggtaga acgttgtcta taatgtctag tctgggttca aatcctggct tcacttctca 22560 
catttacagc tgagtgacct caggcaagtg atttaacctc cctgtacctc agttgcttta 22620 
tctgtaaaga gaaaaatcac agcactgtgg aatagtgggg gttaaaattc attcatacaa 22680 
gtagtgctgc aagcaatgtt taatacaggg tgagcacctg ttcagtgctt ccttcttctg 22740 
gctgcctctg gggctagagt gtggtgtctt cgtggtatag atagatagat atggctgagc 22800 
tctgcacaaa caccaagagc tgttcttcac tattagaggt agtaaacaga gtggttgagc 22860 
tctgtggttc tagaacagag gccggcaagc tatggcccat tgcctatttt aatacggcct 22920 
gtgattgatt gatttttttt ttctttttga gacagagttt cactcttgtt gcccaggctg 22980 
gaatgcaatg gcacgaactc agctcaccgc aacctctgcc tcctgggttc aagcgattct 23040 
cctgtctcag cctctcgagt agctgggatt acaggcatgt gccaccacgc ctggctaatt 23100 
tttgtatttt tagtagagac agggtttctc catgttggtc aggctagtct cgaacttcca 23160 
acotcaggtg atctgcccgc ctcagccttc caaagtgctg ggattacagg cgtgagccac 23220 
catgactggc ctgattgact gattttttta gtagagatag ggtcttggtt tgttacccag 23280 
gctggtctca aacttctggc ttcaagcagt cctccctcct tggcctctcg aatgctggga 23340 
ttataggcat gagccactat gcctggccta tatgacctgt gatttttaat ggttagggga 23400 
aaaaaagcaa aagaatgctt tgtgacatgt ggaaattaca tgaaactcaa atatcagtgt 23460 
cccagcctgg gcaacaaagt gagaccctgt ctctacaaaa aataaaaaaa aataagccag 23520 
ggccgggcgc agtggctcac acctataatc tcagcacttt gggaggccga ggcaagtgga 23580 
tcacctgagg tcaggagttc aagaccagcc tgaccaatat ggtgaaaccc tgtctgtact 23640 
aaaaacacaa aaattagccg agcatggtgg catgcgcetg tagtcccagc tacttgggag 23700 
gctgagacaa gagaattgct tgaacctggg aggcggaggt tgcagtgagc caagatcgcg 23760 
acactacact gcagcctggg caacagagcg agactccgac acacgcacgc acgcacacac 23820 
acacacacac acacacacac acgctgggta tggtggccag cacgtgtggt cccaggatgc 23880 
actggaggct taggtaggag gatcacttga gcttaggtgg ttgagactac aatgaaccat 23940 
gtttatacca ctgcacttta gccagggcaa cagtgtgaga ctgaatctca aaagaaaaaa 24000 
aaaaaaaaga aaaaaatctt tccataagta aatatctgtt ggaacatagc catgtccctt 24060 
agtttatgtt ttatatatgg ctgcttttgc cctataatga cacaattgag tggccacgac 24120 
agtctgtatg gcctgcagag cctaagatat ttgctctctg gccctttaca gaaaaagtgc 24180 
cttgacctgt gctctagagc catatgtacc aggtttgaaa ctcagcctca cagctgggtg 24240 
tgatggcacg catctgtagt cccagctact ctggaggctg aggtgagagg atcacttgag 24300 
tccagaaggt cgaggtcaag attgtagtga gccatgatgg catcaccgca ctccagcctg 24360 
agtgacagag agagaccctg actcaaaaaa aaaaaaacaa aaaaaaaaaa caccctcacc 24420 
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acttatcagc tatttgtctt gagaatagtg acataacccc tcagaaccta tttcctaatc 24 480 

tgttaaatga ggctgatgac gtttcctcct tttactggca atttaaacat gatggataat 24540 

aaatgctaag cacttaacac agggcctaga agatattaac tgctcaataa atggtagctt 24 600 

cttaacagta ttcaaaccca tgtgctctta tcacatgcat tgttgtccct gtgtccagtt 24660 

ggtggaatgg gaaaaggctc ccttgtaacc ccatctacca tctttatcag actttcctgc 24720 



catggttcac agtaagagat agaagctgca cggtgacttc tggctcttta caatggtgag 24780 
cggtgtgtgc ctggtaaggg agagctgatg tcactgcccc aaatccagta gtgagatctg 24840 
ogtgttctgg tttcctccag cagccttgct ttttccttta caatcctgca ggcagggaga 24900 
caagggcttt ctacatggta ggctctggtt tggtcatcgt cacaactggg ggctgttcag 24960 
gtgggctccc attccagata cctaggctta tcaatccctt ttggcacccc oggccttttt 25020 
ctccctcatg ccccattttt cagtttgaaa agcatggtta tcacaggaca agtagaagaa 25080 
gctccactgt ccactgaggc caatggatgg tgttctgcat gtgaacactc agtgaatagt 25140 
gagtgaatga gagtaacctg ggctccatcc tatttgcaga gagctttgga aaagattttt 25200 
ctccttaaag agccagaatg aagcctggta gtgggagagc tccagctcta gagtcacatg 25260 
agcctacatt taaattccag ccctgccact gactcccttt ttgaccttga gtgagttacc 25320 
taatctctct gtacctcact tttcttgtct gtagagtggg aataattcct gtctcagaga 25380 
aataaaagag tgcatatagt gtttgccaca tggagacaca tcaggtgtag gttaatactc 25440 
tgggccttgt ttccttattt gcaacacagc cctgccctgg agtggaagtg gcacctccca 25500 
ttggtcagct cttgaggctg tccccaggae aggcagaggg agggaatgaa tgggagccct 25560 
agtgccagga cagaacagat ggcagctcag agctaggatg gctctctgga cctgtctctc 25620 
ctaccagagg tccccccgtc tggtgtggct cttcctggac ctggcatcct ctgctttttt 25680 
tttttttcca cctccaagca gaattactgt cctgtaggca gctcctctgc ttgaggacat 25740 
ctggggccag atatgttcac actctatcct gccttgccct tccctgagct caggatggac 25800 
gctcaattgg tcccagttat tgtctgcagc gcctgcctgc agcctcgatc cagcccagct 25860 
ccaccccttg cctgcaaggt ctgtttccta acagctgctc caaccacaca cctcggttct 25920 
gcgggagccc ctcctcttcc tccctccctc cctcattcag gggtgggact gaagaagaag 25980 
gctaacttga cagcagcgct tctttcttag ctagtcaccg gcccctgctc aagaatgcca 26040 
gtgtgtgtgt agcctccaca gagaggtcgt tttctcggag tccagagggg ccgcctgagc 26100 
ttctgagaac tagggaggag ccatcccagc catgagcccc tgtgggaatc tgctgggggc 26160 
caagtggcct ggagtcctca ggctcccgca gctgctccgg agggagaggt gagctcaggg 26220.. 
cagcctgcct gcagccagag gtgccgggag ccccgggcct gtcatggtgg ccatctacag 26280 . 
ccggcctgag gcagtcacag acggatttgc agctgagcct gtctatctgg tgtgggaaga 26340 
agatggggag ttacttgtca gtcccggctt acttcacctc cagagacctg tttcggtgag 26400 
ttggtctccg agttcccctc tccatctctc ctggcccctg gtcctgagag gagggtggtc 26460 
tccctaaatc tccttctcac ttagtccttt accatcggtt ctgccgggca gaagccagcg 26520 
gaggttatac ccaaggagaa tcggccttgt gaggtacccc cattatgtcc tggaagtggt 26580 
gaggggaggg atatacccag aaggaacttc ttagggagct ccagctcccc ttctatccca 26640 
gacaaacctg aaggagcctc caaaagatgc cactgacctg cccattgtag atgttactgc 26700 
ttccgggggg aatagcccaa atagagtgct gtttccagct ctcacatgtc ttacctgcgg 26760 
gccatgctgc ctgcccagga atttgtccca acaagcagga tgggcaggtt ttgccaaact 26820 
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gtggaaactg gcaagtcctg ggtgtgggta gcctggtaca cagtaggcac cttataaacg 26680 
tttgttctct taatggcagg cacatttgcc tctggccttg aagggcttct gagctcccag 26940 
- gtgaatgtag ttgctgggga aagacctggg cgagtgcttc taagactgga gcaatgggct 27000 
ttagagtgtt cctgagctgc tgggccagcc cccacacctc ctcagtccct aggcctaagt 27060 
acctccacga gcctctctct gtggggcttc tcagagggag atgtggaaac tctacctcta 27120 
acctggcttt ctttgctcat tgccccactc cacctcccat agaaactccc cagggggttt 27180 
ctggccctct gggtcccttc tgaatggagc cattccaggc tagggtgggg tttgttttca 27240 
ttctttggga gcagcctgtt gttccaaaaa ggctgcctcc ccctcaccag tggtcctggt 27300 
cgacttttcc cttctggctt ctctaagcta ggtccagtgc ccagatcttg ctgccgggat 27360 
actagtcagg tggccaggcc ctgggcagaa aagcagtgta ccatgtggtt ttgtggaatg 27420 
accggaccct ggtagattgc tgggaagtgt ctggacaggg ggaaggggga agggaactgg 27480 
tcctcaatgc tgactctacc aagcgccctg ctagacactt tatcctttaa tctctcaaca 27540 
gcctaaagag attatatatc cccattttac agatgaggca accagtttca acagagttaa 27600 
catatggagc ctcactgggc agctttttct gtcttcctga ctttctctca tccttcaggg 27660 
ggctgcaggt ttgttttctt ctcctagtgg agaggaaatt ctcaggtttg ttttcctctc 27720 
ctagcagaga gtaaaaaaag ggatagtttg cctgacttgt tgaaggtgtg gctgagattg 27780 
ttttctaaag agccaatgga aattgatctt gagtttagga gaaagctttt acatgtggaa 27840 
ttaagatgcc aagtgttgaa gtagccacat ttcaggtcct cattaatttc tcttaatcct 27900 
gggoaggcag cttaggagaa gggttgttcc tttaggagcc aggaactata ccccttttac 27960 
cettggagag gcagggaagc cagggaggac acaacttctc aggaagagga gaagctagag 28020 
cagatagtga actctcaacc tgaaccttta agggccagac cactaatgcc acccaagtcc 28080 
acctgccgtt tgtcttgttc tgtcccaggc tttctggaga acctgatctt cttgccccta 28140 
cccccaagct ccgtttgccc agctagagtc tggggggtac tgactgactt tcgtagacat 28200 
tcttcccttc cccaaataag aggccacatt cctgaagtca cttctgaaga gatagctgcc 28260 
acacagggct ctttcccccc agggagggac cacccagacc ctctgctctc ccaggtatcc 28320 
gttaccacat cactacctgg tcagaaagct gtttctgcca ttagcccctc cctcttttat 28380 
tataggatat cctcaagggc tcctctttgg gcctcagttt catccttggc agaaagtaga 28440 
agctagactt cttgggctcc tgaacagggt ccttgctgga ttctgtgaaa caaattaagt 28500 
tcttgaccct aggcctctgg gggagtacaa agtctatggg agttctgggg ctgtggttgc 28560 
aaggaaagtg acgcaaccag attccatggg gacatgatca ggcgtgacat gtgagggagg 28620 
aagagggagc aagggaatga agaatacaac ttcigtgtcc catacacccc tgcctgacag -28680 
gccatacata ctcagcagag aatgcactgt ctttcctacc acactagcgt gaggagtgag 28740 
ctgcaattac cactgtgctt ccaagtaaga aaatacctca aattggaatt tacaaaagag 28800 
gtaaattagg gagtggcttt tgtcggacat ctttaaagca tttttctttt tatagaattt 28860 
cacttaatgt ccaatactga tttaatgagc ttgggtttac acattatctc ttgaagaaaa 28920 
caaatgaacc tttgtgttcc aaagcaatcc atgtttaaag ggaaaaaatt atgcataact 28980 
ctgcccagct tcacagtaac ctttggcagg tgccttaggt cctctgggac tcttttcctt 29040 
atctgaaaaa tgaaggactt ggatcaggtg aatggttccc agctctgcaa cttatgtggc 29100 
tcctcagagg cacacaagct cttttccatt atttgccaaa taatggaggc cctgtcttta 29160 
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actgcagtac eactacacaa aatacttgaa actacagtet tcctggtttt tggttggaac 29220 
tgaatcagtg cactctagca acacttattt cttgctgttc gtaggcttca ttatgtgttt 29280 
• ggitaatttt ttaaaacaac aataacatat- tccataataa ttacagctta attggcagac 29340 . 
tgtttcagtc tataggatct gcaggaagga- ggagtaataa egggattttt gactgagctc 29400 
ttatggaaca gagtctctct aggcccctgt catatctgcc cttctgggcc ctggggaaaa 29460 
gttggcatcc ccagttgtgg tgctctccag gtgccctcag gctgtggtgg agggagcttc 29520 
ccattctctc cttcagccca ctcaattcag aggctagggg ctgaaagaag cttctctaca 29580 
actggctgtt cactgggagg ttaagggatg accatccagc caggccttcc tcaggacatg 29640 
ggagggctta tgctttaaca tgtgtaaatc cactgcaata atgactggtt cttttacccc 29700 
ataaggttga gaatttacct gtaaacattt ttgtctgaag aatttggatg taagtgaggg 29760 
ctgggcctct atcttatctc acttggcttc tctcagcaca gcaccttgcc tgcttgttct 29820 
tacacatcct agatgcaceg taactatttc ctaattatta gaaatctatt agaatcaatt 29880 
gatttcagct gggcttggtg gctccttcct gtaatcccag cactttggga ggctaaggct 29940 
ggaggatcac ctgagtccag gagtttaaga ccagcctggg caacataggg agaccctgtc 30 000 
tctacaaaaa ataaaaaatt agccaggcat ggtggtgtgc acetgtagtc ccagctactc 3006*0 
aggaggctga ggcaggagga tctcttgagc ctgggaggtc agactacagt gagcaatgat 30120 
tgtgccactg cactccagcc tgggtgacag agtaagactc tgtctcttaa aaaaaaaaaa 30180 
aaaaaagttg atttctattt ggatagataa ataattcatt ttaggacctt tctttttcac 30240 
ttacagaaat ctgtttcatt ctgggctgag aagcaggtcc atattgctag gcataggaga 30300 . 
aaaaggggtc tgtctgeatt tgcccttggt ggtcteaaat tggggaggga aagaaatgaa 30360 
cacttactgg ctaccttctg tgagccaggc atcatgcaag acatctgtac ataatttaat 30420 
tctcataacc ccataagata ttattagcaa tgtacaagtg aggaaactga ggctcagagt 30480 
catgaagtaa ctggccttgg gtgacacaga tggtaaatgg cagagaagga atatggatcc 30540 
aggtcttgaa agagaaaatc tcaactgatt atctttttta aaaaactcat atgttctctg 30600 
ctgactcaaa aggtctctgt gtggatctgg gttgacccac tgaactgacc atcagggttc 30660 
catgcacttt gtatctgccc aagccctcag aacccctcag taatgttttg gaagatgagt 30720 
tttggaggtt gtccttaggc atagcctcag cgtatgtagg cctctaggtg atctccccta 30780 
acetgaggat ttcagctcaa ttcactctgg ctcctcagga cagtgggatg actggttcag 30840 
acctcagctt taccacctcc cagctgggta ctcttctacc tacagccagg gcagattttg 30900 
. actttcactt gaaacttcca aaaattgaaa ggtagaaaoa cagccttggc tttgggaaga 30960 
acgtatgatg tccatggcct ctaagcatct gaggtgggae atgttcgagt agcaccttac 31020 . 
agttccaaag tgtgttctgg gttctttgtt taaaagaaca gagactgctg gggaattgaa 31080 
cactgtgaag tatatgaagg aggagaattg tgctatttaa cattcagtac ttgggctaaa 31140 
ggagaagcat cacgaagtgt taacactcaa agggtcttga gctgtcaggg ctccagcttc 31200 
cttattttca caggtgagaa tcctgaggct cagctgttga gatgtgctgt ctcactccgg 31260 
tgacatagta cagtggatgt ggctttgcag ccaagcacac atagcttcac attccagctc 31320 
catcaattat gtattgggca gctttgcaga atgatttgac tttaactctg cttttcagtc 31380 
ttctgtaaaa cagggataat cctgctaccg tagggttgtc aggattagag ataatataaa 31440 
taaggtacct catataggac ctggattatg gctggcattc aataaatagt agctgttaat 31500 
tgatagctaa gctagaactc tgaagtetac catggcaact tcttaagtgg tctgagaacc 31560 
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cagttgtgtt ctgtggcaaa acacagctta gggatccata cccagccctc ctgtcagctg 31620 
ttcaccttcc agttcttcag agacatgtgt ggcagtgact ttggccacat agctggctgt 31680 
gccctttaaa ggcattcctt gacacagata tgtggactgg tgacgttgct ctccagccag . 31740 
gtgttcttcc cagcaggctg gcctggctgt ctcctgcatg cctgtacttg tttgtctccc 31800 
tgctccctct cctgggcctg gccagagcta cttgcagcaa acaaaagcag gatattggca 31860 
atggaaagga gggtgtgttc tggtgctccc atgccctgcg gcgcacatac cattgcaagg 31920 
gcgtaacaga gcccaggcct gcatttgggt gcaaataagt ctgcacacag aagaaaagaa 31980 
ggacctggtg accaggagcc atggaaccct tgtgctcccc tacctgggct actggttctt 32040 
gccactccta ccattttcag tttggaaata tttgttaagg ctttgctctt ccaggtcctt 32100 
tgcttggtgc tgagtctacc aagagtaagt gggatgctgt ttttgtcctc agggagctaa 32160 
cagtctagtg aagaagaaag atggttgccc aggaacttct aagtcagaag gcaggaggca 32220 
agaaggaagc ccctgctcct actgccagcc ctctgttggg caccccatag ttcttcagaa 32280 
ccacatttaa tcctcactgc aggccaggca tagtggctca cacctgtaat cgcagcactt 32340 
cgggaggcca aggcgggcag atcacttgag gtcgggagtt cgagaccagc ctcaccaaca 324 ( 00 
tggggaaacc ccgtctctac taaaaataga aaaattagcc gggtgtggtg gcatgcgcca 32460 
gtaatcccag ctactcagga ggctgaggtg ggaaaatcac ttgaactcgg gaagcagagg 32520 
ttgcagtgag ccgagattgt gccactgcac tccagcctgg gcgataagag caaaattcca 32580 
.tctcaaaaaa aaaaagaaaa aagaaaaaat cctcactgct accttgaaag taggtgatga 32640 . 
cattgccatt tcacaaatga gaagtgaagg ggctagccca agatcactta ggtggtaaat 32700 
ggtggtgcta agattagaac ctcagatcat ctagggaaaa acacagatat gcacagagtt 32760 
aaggggaccc agggtattgt ttgtcctctt gtttcacagg tggggaaaca acccagagag 32820 
ggaaaggggc ttgtccaagg caatttagca cccaagaact tgaacccata tctctctcct 32880 
cctcatttag agctcatccc acatgtatct tatattgaga ggagtgtgag ccacatacca 32940 
agaacagtct tcccctctgc ctccaacctc actgtgcagt tttgagacac ttcacagcca 33000 
t act ct teat gccataccca geccttaaga ccctgaagtt ccccttccat aagacaagta 33060 
ggaaaagcta tagggtaaaa atagecatea gtgtttgttg agcacccagg aggaattggg 33120 
cactccagaa agataaaggg attctcaggg acttgettet ctagacttcc ctagctcagc 33180 
tgcttcaact cattcctgcc cctcttctct acctcccgca gtgctcagaa gtagtagaac 33240 
tcactgtggc ctctcacctt gcattgttga gttttattta gactttctct tcctcaactc 33300 
ttcataagct catgaaaggt gaagtagggt gccctgtgta tttatctttt atatctgeag 33360 
tgcttagcaa gttataataa tgcacttgcc tggcaaaagg ctttctctca tacattagct^ 33420 
tatttcctct tcacattggc tctttgtagt aataggatgc tattagttat tttcaatgag 33480 
agaaagctac taagagaagt tgtccagcta gtgacagtaa gtggctgata aagtgagctg 33540 
ccattacatt gtcatcatct ttaatagaag ttaacacata ctgagtttct actatattgg 33600 
gtcttttttt tttttttttt ttttttttta gagaeggaat ettgetctgt tgtccaggct 33660 
ggaacgcagt ggtgcaattt tgggtcacca caacctccgc ttcccaggtt caagegatte 33720 
tcctgcctca gcctcctgag tagctgggac taccagtgea cgccaccacg cccggctaat 33780 
ttttgtattt ttagtagaga cagggtttca ccatgttggc caggctggtc ttgaactcct 33840 
gaccttgtga tctgcccgcc tcagcctccc aaagtgctgg gattacaggt gtgagccacc 33900 
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gcgccctgcc tatattagga cttttateta agctatctct agctagctag ctagctagct 33960 

ataatgtttt ttgagacaga gtctgactct gtcacccagg ctggagtgca gtggcgtgat 34020 

... .. . . : 

ctcgactcac tgcaacctcc acctcctggg ttccagtgat tctcctgcct cagcctcccg 34080 

agtagctggg attataggtg catgccacca cgcccagcta attttttgta tttttagtag 34140 

accaggtttc accatgttgg ccaggctggt ctcgaactcc tgacttcaag tgatccaccc 34200 

gcctcggcct cccaaagtgc tgggattata agcataagcc actgtgccca gctgctctct 34260 

atatttttaa tacatattat ttccattaat tttcacagca gttcatttta tagatgagga 34 320 

aactaggcca gagaagtaaa atatcttgcc caagatgatg taactagtaa gtggcaggat 34 380 

caagattcaa accaagcaat gttcaaacct cttggaagca agaatgtggc cactgtggaa 34440 

ggtgcaaggc cttgacaaca agaataggga aaagaoggaa ctagaaggaa agagatggca 34500 

tgggctcagc aggccaggga gctcttagct gtgtgtgttg ggaagctcag aagggaggaa 34560 

gaggttgtct gtgcaggtaa gtcctgagaa cacaccagac ttttgagagg tggagcttca 34620 

tagccaggtc attaggggag aagggagcta tagatttttt tttttttttt tttttttttt 34680 

ttttttttag agacggggtc ttactatgtt gcccaggctg gtcttgaact cctgggctca 34740 

* 

agtgatcctc ccacctcagc ctcccaaagt gctgggatta gaggcatcag ccaccccgcc 34800* 
cagcgagcta tggatctaac atgtacatct tacacagtgc taatagaatg ttgggtttct 34860 
tccccaatat tttattttga aaaaaaattc aaatatatag aaaagttgaa aaatgtagtt 34920 
caaagaacac ctacatacct ttcacataga ttcatgattt gttaatgtta tgccactttg 34980 
tatatatctc tctccctcct atctgtatac itttatttat ttatttttgc tgaactattt 35040 
cegagtaact taaaggcatc ttgattttac ccttgaacag ttcaatatgt ttctgctaag 35100 
aattctccta tataagtcag atatcattac atctaagaaa attcacggca attttacaat 35160 
ataatattat agtccaaatc catatttcct cagttgttcc aaaaaatgtt catggctgtt 35220 
tcctttttta atctaaattt gaatccaagt ttgaggcatt gtatttggtt gctgtgtctc 35280 
tagggttttt aaaatctgtg ccttttcttc tccccatgac tttttagaag agtcaagacc 35340 
ggttattctt atagaataac ceacattcta gatttgcctg attagttttt ttatacttaa 35400 
cgtatttttg gcaagaacat tacattggta acgctgttgg tgatgggtca gttttgaaga 35460 
gtggagatga ttaaactgct tttgttcatt gaagtatctg tcaagaccag ogatccttaa 35520 
ctggtgccat aaataggttt cagagaatcc tttatotata caccctgtcc cccacctaaa 35580 
ttatatacac atcttcttta tatattcatt tttctagggg aggcttcttg gcttttatca 35640 
aattctcaga gggccccaag acccaaagag gttatgaaac actagtctgt ccactgaggc 35700 
aggcaacaca gagctggttt^ctggggcctt gttcagtctg aaccagcttc ccttggggag 35760 
atagcacaag gctgtaactt tgccccatct tggctttgga tcaaagagga ctgtccattt 35820 
tgttgtcata cctaggaacc agggacagct tatgtggcct ggttccaggg atccaggaga 35880 
atttcagttc ttgtcttgcc tttcaggtgt tcagaatgcc aggattccct caccaactgg 35940 
tactatgaga aggatgggaa gctctactgc cccaaggact actgggggaa gtttggggag 36000 
ttctgtcatg ggtgctccct gctgatgaca gggcctttta tggtgagtga atcccttcat 36060 
atctgcccct cttggtcttc agagtccatt gacagtgctt ccagttccct gtggcctgtt 36120 
aatcttttag tctttccatc agccagggca -tctcccttta tttattcatt cattcaacta 36180 
gcaggtatca attgagcacc tactaagtga aaggtaagat ccttccctca aagacttaat 36240 
agttgaacgt tgggagtggg aggagaggca ggcagagagg agacacaata tagttggata 363&0 
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aggacctcca aggagagtgt tacaggctga gaggaggata tacttaggtt gtctttaggg 36360 
aatcagaaaa ggagactctg gaataggctg gcagagagag gggctacctc ctatacctgc 36420 
. tc£ggacaaa cgactttaag cat'agtgaca gatttgccaa ccctgtattg gaagaactga 36480 
tcttttttag tggggatgat tacttctggg gatttcttct cataactgag accaaaacag 36540 
ttttgtgcag tctcagaaat gacaggaggt accaatctga cacttccttt ggaagctcta 36600 
gggcagagag tgaaagagtg gattttgacg ggggccttgc ttggaggtca ttcacccacc 36660 
cctgtcctca ctccagcaac agtgataact cacttccttc ctccctttgt acacccttct 36720 
ccccacctgc tcacaggtgg ctggggagtt caagtaccac ccagagtgct ttgcctgtat 36780 
gagctgcaag gtgatcattg aggatgggga tgcatatgca ctggtgcagc atgccaccct 36840 
ctactggtaa gatagtggtc ctttgtctat cctctcccat ataagagtgg ctggcgggga 36900 
gggacagtgg cagggtgagt tgggcagaag gagtgttagg gtagtcagag cattggattc 36960 
ttaccacagc agtgctctta accagctctt taacttgtaa gcagaatgat ttacacatgt 37020 
ctctaccctt tttccttacc aaccttgaaa atgtcttcac tctgccctgc aatcctccca 37080 
gtgggaggca ctcttcaagg acgatcccag aacattaaag tcaaagaccc cttagagctc 37140, 

1 

accctgtcca accaccttgg ttgataaaag aagtcagcct ggggcccatg gaatagaata 37200 
gtacaagggc aaggttctca ttgtgagtca aaggtagagt gaagagaacc cagaccatct 37260 
caccccaacc caggccagtg tttttccaaa tataccactt gctgcagatc tagctcagca 37320 
cccccagtcc cagcccaccc tgagaaccca ggctcctcat tctgagcagc cagctagaat 37380 
catgacaaag agggtggtag tgagactatg ggtactgttg cttaaagcca catggtgcag 37440 
tggttgctgg ggggcttctg tgtgggactc tagcatctta ttcccccctg tgccctctcc 37500 
ccagtgggaa gtgccacaat gaggtggtgc tggcacccat gtttgagaga ctctccacag 37560 
agtctgttca ggagcagctg ccctactctg tcacgctcat ctccatgccg gccaccactg 37620 
aaggcaggcg gggcttctcc gtgtccgtgg agagtgcctg ctccaactac gccaccactg 37680 
tgcaagtgaa agagtaagta ttttgagaac ccttcagcag gggttcttga gcagagtctg 37740 
taaatgggcc tcagagggct tagacctcca aagtctcatg cagaactccc tttattctca 37800 
tctcatatct ttctcctgga ccccactatg ctgtaaccgt acctgggcct tggcacttac 37860 
tgttctctct gcccaggcta cttcctaccc gatacttaag gcaagaatca ctcacctttc 37920 
aggtgtcagg tttcaggtca tgtttgctct ttgaaatcat ctggcttgat tatgtgtatt 37980 
agttgtttat cttctatccc ctccactaga atgtaaattc cagaagaaac ttgctgtctt 38040 
attcagtgct gcatgcccag. ggcttggaag agtacctggc atatagtagg agttgattga 38100 
ttattatttt gtcagtcgag agaatgaatg gagaaaatgt ggtccatggc ccaaaagaag 38160 
ttaagaccct atcctagatt caggccagag accagatgga gaaagagtct gtgtctatct 38220 
aataccagta atgtcgtacc tctggccgct taccatgtaa atattgattg tgtatctacc 38280 
atgtgttgga cactaggcta gtgcttgcac agcaggtgaa agatactaga gtttgggaag 38340 
tcaggaggag ctaaggtctg ttctacaacc ttattagatg aagaggagag ggaattgtgt 38400 
tcagggcaga gggagaagca tttctccaaa agtaggagtc ttaatcatgt ctgatgtagg 38460 
ttgagtgtgg ccagaaaagg ggctgttaag tatagagggc ctggattatg aaaatccagc 38520 
agatccattg agagtttaag cagcaaggtg ttgtgaccaa gttaacattt tagaaggatc 38580 
actggtatgg aggttggatt ggagagggga aagcctaaag gtatagagac tagttaggaa 38640 
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gctattgtag gctgggcatg gtggttcatg cctgtaatct cagcactttg ggaggctgag 38700 

gtgggaggat tgcttgaggc caggagttga agaccaacct ggccaacata gcaagacccc 38760 

•gtctctgttt ttcttaatta aaagaaaagt ccagacgtag acatagtggc tcacgcctgt' 38820 

aatgccagca ctttgggagg ccaaggtggg cagattgctt gaggtcaaga gtttgggatt 38880 

aggccaggcg cagtggctca cgcctgtaat cccagcactt tgggaggccg aggtgggcgg 38940 

atcacaaggt caggagatca agaccatcct ggctaacaca atgaaacccc gtctctacta 39000 

oaagtacaaa aattegccgg gcatggtggc ggacgcctgt agtcccagct actcgggagg 39060 

ctgaggcagg agaatggcgt gaacctagga ggcggagctt get gt gage a gagatcaege 39120 

cactgcactc cagcctgagc gaeagagega gactccatct caaaaaaaaa aaagagtttg 39180 

ggattagect ggccaacatg gcaaaacccc atctctacaa aaagtacaaa aaaattagct 39240 

gggtatggtg gtgcgcgcct gtaatcccag ttactcagga ggctgaggca tgagaattgc 39300 

ttgagcctgg gaggtggagg ttgcagtgag cccagatcat gccactgcac tccagcctgg 39360 

atgacagagt aagatgecat ctcaaataaa aattaaaaac aaagtttaaa aaaaaaatag 39420 

aagctattac cgtgatccag gtaagagatg tgaataacta caatgatgga aagaaggcag 39480 

agttcttaga gatgggagta ggagagatga gggaactcca gattgggaag atgatgttca 39540 • 

agtttctggc ttaggecaca gggtgagtgg caattccctt cactgagatg gggcatcctg 39600 

gaaaaggtgt tgcctttctg tgtgggtatc ctgggcccct taggggecac tggtggcctg 39660 

. ggacctggta aaccttccct gcacaagcag aattggtcaa gcaggttttt aggacatctt 39720 

taccctgcct caactcttgt ctggcccagg gtcaaccgga tgcacatcag tcccaacaat 39780 

■* ■ ■ . * ... 

cgaaacgcca tccaccctgg ggaccgcatc ctggagatca atgggacccc cgtccgcaca 39840 

cttcgagtgg aggaggtaga gtgtgtgtct aatctgtctt gtgagggtgg gacatggaac 39900 

agatcctctg ggaaatcagg ctgtagcctt taccttttcc tacccccagc ccatctcttt 39960 

gtcttagcat tgagcctgtg accactggtg acctatttca gegtaacagg ttcccagggt 40020 

agcagggatg gttgatggac gggagagctg acaggatgee aggcagaggg cactgtgagg 40080 

ccactggcag etaaaggeca ccattagaca agttgagcac tggccacact gtgcctgagt 40140 

catctgggtt ggccatgggt ggcctgggat ggggcagect gtgggagctt tatactgetc 40200 

ttggccacag gtggaggatg caattageca gaegagecag acacttcagc tgttgattga 40260 

acatgacccc gtctcccaac gcctggacca getgeggctg gaggcccggc tcgctcctca 40320 

catgeagaat gccggacacc cccacgccct cagcaccctg gacaccaagg agaatctgga 40380 

ggggacactg aggagaegtt ccctaaggtg ccacctccca ccctggctct gttctgtcct 40440 

atgtctgtct cteggatgaa. gctgagctgg ctttcagaag ectgeagagt. taggaaagga 40500 ; 

accagctggc cagggacaga ctatgaggat tgtgctgacc cagctgcccc tgtggggatc 40560 

acagtttaca gccagagcct gtgcggaccc agctgtctgc caggtttcct tagaaacctg 40620 

agagtcagtc tctgtccact gaactcctaa gctggacagg aggcagtgat gctaaaccct 40680 

gaagggcaac atggcctatg gagaaagcat ggagctcaga gcctggagta egggcacaga 40740 

taggattgaa taaattgtgt agaaagactt tgaaaacaat aaagcaaaag atgaatgaac 40800 

gtttttttta gacttgaggg accaacaacc cccaaacccc agattctgee aggtccatgg 40860 

ggaaggagaa gttgccttga g^ggaagece caagtaggga gacttacaga aaagaagtca 40920 

agagcactgg ctcccaggca gaaatactga taccctactg gggcttcagg ctgagctcct 40980 

cccttcacaa atcacttcat ctctctgagc ctgtttctgc atctgtgaca taagatggta 41040 
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agataaaggt ggctgtctca ccaattatgt aaggattaaa tgtggaaaag gacataaagt 41100 
. tgtatagtgc tgccataggg acagtgttca gtaaacgtga cacattctta gtatcactaa 41160 

gaatcaggtt cttggccagg caccgtggct catgcctgta atcccaacac tctgggaggc 41220 ; 
ctaggtcgga ggatggcttg aaeacaggag tttgagacca gcctgagcaa catagtgaga 41280 
cactgtctct acaaaaaaaa aataataata ataattgttt ttaattagat gggcagggca 41340 
ctgtggctca cacctgtaat cccagcactt tgggaggcca aggccggagg attgcttgag 41400 
gccaggagtt caggagcagc ctgggccaca ttcctgtctc tacaaagaat aaaaaagtta 41460 
actgggcatg gtggcacatg cctgtaatcc cagctactca agaggctgag gaggaggatt 41520 
gcctgagccc aggagttcaa gactgcagtg agccttgatc acaccactgt actacagctt 41580 
gggcaacaga gtgagacctt gtctccaaaa aaaaaagttt gttttttttt atccactctc 41640 
ctcaccaaac aaactgagta agttagagcc ctctcagctg gcatgtgttg gaaacagtgc 41700 
cctctcatta aagtgctgcc ctcactccca ttgcctcttg gccttggtca gtatgatgaa 41760 
attagtggga ggcagggcaa cagagggcag ggaagagcta gaaatccatg gcctggaaaa 41820 
gggaagattt gggagtggcc aggtatctgt agagccacca tgcagaggag gggggcagct 41680^ 
agccttgtgt gctctggtgg gcatggtcag caggaggcag agcaaaagga caagggtaag 41940 
taaacctgta ggtcgggaca agccaagagc catccagcgt cagtcctctc tgggtagccc 42000 
aagtaaagca ggagcatacc ccagagagaa agttcgcagg gctgttcacc tgcagtgctg 42060 
. tggacttcaa ccttcttgtt ccttcttcag taagtgaaaa taacagtcat tgaccatgac 42120 
tattatcgac cgcttttgaa aatgtaaaca tagtgacttt attgctgtaa aaatcatacg 42180 
tgtttatcat cttaaaattc aggaaacetg gacaggtaca aagatgtgca aaatatcatc 42240 
caaaatccca tttgctggcc aggcacggtg gctcacgcct gtaatcccag cacattggga 42300 
ggccgaggcg ggcaaatcac ttgaggtcag gagtttgaga ccagcctggc caacatggtg 42360 
aaaccctatc tctactaaaa atacaataat taggctgggc gcagtggctc acgcctataa 42420 
tcccagcact ttgggaggcc gaggtgggcg aatcacaagg tcaggagttt gagactagcc 42480 
tggccaatat ggtgaaaccc catctctact aaaaatacaa aaattagggc cgggtgtggt 42540 
ggctcacgcc tgtaatccca gcacttaggg aggccgagac agatggatcg cgagatcagg 42600 
agttcgagac caacctagcc aacatggtga aaccccatct ctactaaaaa aatacaaaaa 42660 
ttattcggtt gtggtggcac acgcctgtaa tcccagctac ttgggaggct gaggcaggag 42720 
aatctcttga acctgggagg cagaggttgc agtgagtgga gatcccgccg ttgcactcca 42780 
gcctgggcga cagagtgaga ctccatcaaa aaaaaaaaaa aaaaaaaaaa aaattagccg 42840 
ggcgtggtgg cgtgcaccta tactcccagc tacttgggag gctgaggcag gagaatcgct 42900 
tgaacctgga aggcggaggt cgcagtgagc cgagatcgtg ccattgcact tcagcctggg 42960 
cgacagagcg agactctgtc tcaaaaataa taataataac aataactagc cgggcctggt 43020 
ggcacatgcc tgtagtccca gttactcagg aggcggaggc atgagactca ggtgaactag 43080 
ggagacagag gttgcagtga gccaagatca caccactgca ctccagcctg gttgacagag 43140 
cgagactctg tctcaaaaaa aaaaaaatcc catttgctca ttttttggat actagtataa 43200 
ctatcactct aaaccagtta gtacttaaat caagcagata tgggagatgg tgaattacca 43260 
tctacagtgt tgtcatatat gtcacatact gagcattatc agctagtaga atctagttaa 43320 
ttgttctatg tgtgatgtat gcagagttcc cattttgaat gtgtttttac tatgcttaaa 43380 




US 6,340,583 Bl 
77 78 

-continued 

taaatgactg etgtcagcaa ccccaaaatg atacatctga tgtaagagcc cctgttcccc 43440 
aataataaca tctaaactat agacattgga atgaacaggt gcccctaagt ttcctccctc 43500 

* - 

cagggtttct'tggccggtct ctgaggacta cacatcccta ctcccgtctt tcctcatctt 43560 

_ 

caggcgcagt aacagtatct ccaagtcccc tggccccagc tccccaaagg agcccctgct 43620 
gttcagccgt gacatcagcc gctcagaatc ccttcgttgt tccagcagct attcacagca 43680 
gatcttccgg ccctgtgacc taatccatgg ggaggtcctg gggaagggct tctttgggca 43740 
ggctatcaag gtgagcgcag gcaacaattg ctttgctctt ctgcccccag tccctctgtc 43800 
actgtctttc ggggatttct catcacttgg ccccacccca caccatgcag gatgccaggc 43860 
ctccttcctg gctttgggtg ttggtgtgag aggtatcctt cacccccacc caggccacct 43920 
aaggtcaatg ttgctgttac agtgagcttg tggacctgga gatccaggtt gggttgagct 43980 
gtgcctgtgg ccctcctgcc tccagtcagt gggtgtttgt taggtgcctg cagacctcag 44040 
taccgggcat gctacaagga gcacacaggg gaatggctcc tgcctccctg gtgaacagtc 44100 
tcagggacta acctctctct ttctctcctc ctcctcctct tctgctgaga actgggaggg 44160 
ggggtcaggt aagacgtgtg tctcagcttg ggggcagcag ggctggagag ctcacccccg 44220 
atccacccag ctccctggtg catgtctttg gcactgacct tcctgccccc agacttctgt 44289 
tcactcagga gactcacttc tatgccaaat gaccagagcc cctgcttggc ttggcagcat 44340 
cccctcctgc cttcttcccc octtcccttt tctgggttct tgcctgtcct ctgtgcatgc 44 400 
ccagctctcc aggaaagagg gtttgcttcc gtgtgagtcc catgttgctc cacgctgcat 44460 
cttccacaca tgaactctgt' cattctgacc cggctcagtg tgccctccaa gggatgggat 44520 
ggccagctgc atagattttc tcaaacagtt ctccagaact tectctggte tcagcaccat 44580 
taacagtcac cctccctgta ggtgacacae aaagccacgg gcaaagtgat ggtcatgaaa 44640 
gagttaattc gatgtgatga ggagacccag aaaacttttc tgactgaggt aagaagatgg 44700 
agggggcccg ggaggttggt gtcaccattg gaagagagaa gaccttacaa ataatggctt 44760 
caagagaaaa tacagtttgg aattactgtc ttaaagacta agcagaaaag agccctagag 44820 
gaatatccca ctccctctaa attacagcgt aattatttgt tcaatgaaca cttactaaaa 44880 
gcaacacaaa cagggtacaa gggatgcagt aacaaaagat acagggttca gaagagctct 44940 
caggttatga ggatgatgga catgaaaaca ctccaattta gtacaactca atgttataat 45000 
cctcacctga acgccctgct aagggagcct ggaggggagc tccctgagca ctcacactcc 45060 
ttgggcattt acagttttca ctacccctcc caagttactt catggagtaa cttaagttgg 45120 
ggacacctgt ggtctgggta ttgccctcca agccacttgg ccactcccac cccagttctc 45180 
ccaatgcagt tccaagggta aggcctatga agccatctcc. atctatatgg tggtggtctt 45240 

ccctcatcct gatcttagtg ccctgtcata tcacaagata ggaggtagga gatacaggtg 45300 

gtaacacttg tcaagctgat tccttggagg gaagaggtaa ggaagacagt gagaagttaa 45360 

ccaccagctt tccttggctt cccccacccc caggtgaaag tgatgcgcag cctggaccac 45420 

cccaatgtgc tcaagttcat tggtgtgctg tacaaggata agaagctgaa cctgctgaca 45480 

gagtacattg aggggggcac actgaaggac tttctgcgca gtatggtgag cacaccaccc 45540 

catagtctcc aggagccttg gtgggttgtc agacacctat gctatcacta ccctaggagc 45600 

ttaaagggca gaggggccct gctttgcctc caaaggacca tgctgggtgg gactgagcat 45660 

acatagggag gcttcactgg gagaccacat tgacccatgg ggcctggacc acgagtggga 45720 

cagggctcaa cagcctctga aaatcattcc ccattctgca ggatccgttc ccctggcagc 45780 
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agaaggtcag gtttgccaaa ggaatcgcct ccggaatggt gagtcccacc aacaaacctg 45840 
ccagcagggc .gagagtaggg agaggtgtga gaattgtggg cttcactgga aggtagagac 45900 
cccttcctat gcaacttgtg tgggctgggt cagcagctat tcattgagtt tgtctgtgtc 45960 
actgaoactg accccagcca actgttctca gttcacagcc ctgttttcaa agaattacac 46020 
atctctaaag gcaaacaggg cacggacaag gcaaactgga gaggcaaact gtagcctgag 46080 
atggcctggg cttgccatca caggtattca ggtgctgagg gcccttagac caactagagc 46140 
acctcactgc ctaggaaatc aatgaagggg aaatgagttc tagcggagcc ctgaaggatc 46200 
agaattggat aaagttctta ttggcagaga ggcaccagga ttgaagtgac aggagcaaag 46260 . 
acctgggagg aaagaggaga aaatcatcta tttcacctgg aaacaaatga ttccaagcat 46320 
agaaataata acagctgaca agtactgagt gccctctata tgctaggcac tgggctgagg 46380 
gattaacatg catgtgcatg tttattcctc atgacaacct tggtttccag ataagctgga 46440 
ctggaaaggg acagagctgg gatcctgggc taatcagtct ggtcgccaag cctgagactt 46500 
tagccactge ccttcacatg ggggtccatg aaaatagtag tagtctggaa cagtttgggg 46560 
gtacatcaag gtcgctgtgt tttaagctat ggagtctgga ctataggaga caaatgtaaa 46620 
agagtttttt ggttgactgg ctttttggtt tttttgtttg tttgtttgtt tgtttgtttg 46680 
tttgtttgtt ttttcctgtt tctggggctt gaatcaggaa ggaggttttt ttgttgttgt 46740 
tgttttgaga aaggatattg ctctgttgcc cagactggag tgcagtggca cgatcatggc 46800 
tcactacagc ttcgacctcc tgggctcaag caatcctcct gccttagcct cccaagtagc . 46860 
tggafctacag gtgtgtacca ccacacctaa ttttttgaat ttttttttct tttttttttt 46920 
tttttttttt ggtagagaca ggttctcact ttgttgccca ggcctgaatc tcaaactcct 46980 
gggctcaagc attcctcctg cctcgccctc ccaaagtgtt gggattacag ttgtgagcca 47040 
ccatgcccgg caggaaaaga tttttaagca agaaagctta agagctgtgg tttttccaaa 47100 
atgagtctgg gctggcacag tggctcatgc ctgtaatccc agcacttttt tgggaggccg 47160 
aggtgagtgg atcacttgag gtcaggagtt tgagaccagc ctggccaact ggtgaaaccc 47220 
ctgtttctac taaagaaaaa aatgcaaaaa ttagctgggc gtggtggtgc acgcctgtag 47280 
tcccagctac tcaggaggcc gaggcaggag aatagcttga acctgggagg cagaagttgc 47340 
agtgagccaa gatcacacca ctgcattcca gcctgggtga cagegtgaga cttcatctca 47400 
aaaaaaaaaa aaaagagaga ctgatatggt tagtacattg gggtggaatg cggagggtcc 47460 
agggaatgga gccctgcata gggggctaat gaaacatttc agatttctga attaaggtag 47520 
tggctgtggg gacaggagcc tgggaggcag ggtggagtca gaatggagag actggttggc 47580 
aatgagggaa caggaggagg aggaggagga gttacgagtg gcttgaggtg tcacttaeca 47640 
gacatttggg ggatggggga -tagccgtgat tgttgagcaa ctggtttggg aagagctagc 47700 
attgatccct gctgttctgt gctagcagaa cctatcagca tcttctgggc aggaaactgg 47760 
ctccatgaga ctggcttagg gagaggctgc tagtcaccta atctgcagag aaggggcagc 47820 
tggagctgtg ggacagaaga ggcatccatg tagctggtgg gggtgtctca gcttgtgaag 47880 
aggagatggc tttgagcagg gctgacactg aaaaggctgg aagaaaaaaa cagacacaca 47940 
agagtctcag gatcaggtag cataggaaag ttgtggacag tctttgagga gcactccctc 48000 
aggcaggcag gcaggcaggt catgagctat agcgattcag gaagagctcc ctgggtgtgt 48060 
gagcagctcc aggagcctaa gggatgaaag tagtattgca gggggctgga gagcaaggag 48120 
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tggctccttc tacatttgca agggaaggag aaaggaagtt gctcctgaga gtggtaagag 48180 
tcagtggtgg aggcctggag aggagacata acaaacaaat -ttgttgacaa acattttggt 48240 
/aggaaggggg agagcttaaa gtttagacag tggggaaggt ggagtcttag aggaggtgaa 48300 
tgtctgaaag acagagctag ctggagcaag aagtcecttc tctgttgcag gcaggaagga 48360 
tccaaagtgg ctcaagccag agattgggag agtggggagg egggagcagc ctggatctaa 48420 
gtaaaatggg tagaggtgga gggggtgctg caacggccag ggttttctga agttggggac 48480 
attaggagag agctgtgagg gctttggcca gccactgtgc tagtgattgg tgaaccaaag 48540 
gatgggcagg agatggcagc agggaagcag aggaagtcca ggcttcctgt tggtattggg 48600 
acaagggaga ggccatagga ggccctggcc ctgttgtcca ggttgggttc tgaagctggg 48660 
tgggcatggc ctggtaggag agcatctatg gcgcccaatt ccagattcag ggtctagttg 48720 
atttgctggc cctgtagcct cagctcatgc ttctgttcca ggcctatttg cactctatgt 48780 
gcatcatcca ccgggatctg aactcgcaca actgcctcat caagttggta tgtcccactg 48840 
ctctgggcct ggcctccagg gtcctatcct tcctggcttc cttgtcacaa aggaggctga 48900 
cttgtcccct ctggctagag ggcagaggtg ttgcctagga gctcctatct ttcccttcct 48960 
gcttcttcca atgcccttct ctgtcctctg ggagctccga gacacacaca gacataattt 490*20 
caccttctct cattagcaac ctttgaaata atttgattag aagggacttc agaagtttgt 49080 
tgactatatg tagaaaaccc tgtcatttta cctgcttttg ccccatagta gtcttgtaaa 49140 
acagttcatt gctgacccca ttttacagtg gtggcacctg aagcctcagc ctgaggccac 49200 
cgagctagta aatttacagg gaccagtttg agaccagcat tcctcccact gcccctcagc 49260 
tgtggtggtt acaatgttgt ttgtcttact gacttgctat ctggcttcct gggtgtctac 49320 
cggctggccc tggctctgcc ctctagaccc acaccacgca atcttcattc ctttcccaca 49380 
tgaetgccct gtagctattc aaagagcttg tctcccccaa gtctccccat ctactgcctc 49 440 
caccttgcct ttttctgtct tatcctggtt ctagccactg cctgaaatca ttttaggaat 49500 
aagacaggac agggaaaaac aaaagcaacc ccctgtccca cctctgagtt ccactctcca 49560 
agtccctgag cctcacctcc agggctccag tggctctgcc atgaacccac tgtgggctgg 49620 
gagtctgctg tgcacagata ccagaccctc agaaacacaa atgccaagtg tgtctgtttt 49680 
tttgttttgt tttgttttgt. tttttagatg gagtctcatt ctgtttccca ggctggagtg 49740 
cagtggtgca atcttggctt actgcagcct ctacctcccg ggttctagtg attgttctgc 49800 
ttcagcctcc cagtagctag gactacaggc gtgtgccacc acgcccagct aatttttttt 49860 
tttttttttt tgtattttta. gtagagacag ggttttgcca tgttggccag gctggtcttg 49920 
. aactcctgac ctcaggtgat tcacccgcct tggcctccca aagttctggg attacaggtg 49980 .-. 
gaagccaccg tgcctggcct gagtgtgtct atttgataga gctttctgct ctgattctcc 50040 
cttgctatac accttttctc cccttctcag tggcttctct tgcctatgct tcctccccag 50100 
ggccaggttt gagaacatcc ccatgaagtc ctgacctgtc ttttatccta ccaggacaag 50160 
actgtggtgg tggcagactt tgggctgtca cggctcatag tggaagagag gaaaagggcc 50220 
cccatggaga eggccaccac caagaaacgc accttgcgca agaacgaccg caagaagcgc 50280 
tacacggtgg tgggaaaccc ctactggatg gcccctgaga tgctgaacgg tgagtcctga 50340 
agccctggag gggacacccg cagagggagg acagatgctg cccttgcatc agagccctgg 50400 
gaattccagg ggaggcctgt gaagcgtagg accggatacc cagagctgag gatatttttc 50460 
ccttgccagg tggggcctca cgatttagct cctgagctca gggggctggg aaetgatcag 50520 
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tgtcccatca tgggggataa ggtgagttct gactgtggca tttgtgcctc agggatcgct 50580 
aagagctcag gctattgtcc cagctttagc cttctctctc catggtgaga actgaagtgt 50640 
ggtgccctct ggtggataat gctcaaacca accagagatg ctggttggga ttcttgaaat 50700 
cagggttgtg aggcctcaga aatggtctga atacaatcca ttttggagtc tgaggcccag 50760 
agaagttcag -tgaattgcct aggagcatac agctgcctaa tggcagaggc tagatgaacc 50820 
ctagtctggt tcttttccac tttaacgtgc agtttcatcc taggcagtgt tatgttataa 50680 
gggctctcca aggcagttca cctacggctg aggaaggact attttcaggt ggtgtctgcg 50940 
caggacagcc tgtggggtgt ccctacagaa cctgttctag ccctagttct tagctgtggc 51000 
ttagattgac cctagaccca gtgcagagca ggtaagggat gtaaacttaa cagtgtgctc 51060 
tcctgtgttc cccaaggaaa gagctatgat gagacggtgg atatcttctc ctttgggatc 51120 
gttctctgtg aggtgagctc tggcaccaag gccatgcccg aggcagcagg cctagcagct 51180 
ctgccttccc tcggaactgg ggcatctcct cctagggatg actagcttga ctaaaatcaa 51240 
catgggtgta gggttttatg g-tttataacg catctgcaca tctttgccac gttcgtgttt 51300 
cattggtctt aagagaagga ctggcagggt ttttttgttt tagatggagc ctcacttcgt 51360 
tgcccaggct ggagtgcagt ggcacaatct gggctcactg caacctctgc cttctgggtt 51420 
caagtgattc tcctgcctca gcctcccaag tagctgggac -taccggcaca caccaccatg 51480 
cccggctaat ttttgtattt ttagtagaga cagggtttca ccatgttggc caggctggtc 51540 
ttgaactccg gacctcaggt gatccgcctg cctcagcctc taaaagtgct ggaattaata 51600 
ggcgtgagct acctcgcccg gccaggtttt tttttttttt tttttagttg aggaaactga 51660 
ggettggaag agggcagtgg cttgcacatg gtcgataagg ggcagatgag actcagaatt 51720 
ccagaaggaa gggcaagaga ctgttcatgt ggctgtctag ctagctcttg ggccaaatgt 51780 
agcccttctc agttcccttc aagtagaagt agccactcta ggaagtgtca gccctgtgcc 51640 
aggtaccacg tggacagagt gaggaatctt ggaaagattc ctacctttag gagtttagtc 51900 
aggtgacagc atatctcagc gactcaaaca cacacacatt caaagccttc tgtaattcct 51960 
acaaagttgt gaggggtaga ggagaggaga gacaagggat ggttaggata atgaaggaat 52020 
gttttgtttt tgtttttgtt tttgagatgg agtttcactc tgtcacccag gctggagtgc 52080 
agaggtgcaa tcttggctca ctgcagcctc cgcctcccag gttcaagcaa tcctcctgcc 52140 
tcagcctccc aagtagctgg gactacaggt gtgcgccacc acgcctggct aatttttgta 52200 
ttttcagtag agacagggtt tcgccatatt ggccaggctg gtctcaaatg cctgacctca 52260 
ggtgatacac ccgcttcagc ctcccaaagt gctgagatta caggcatgag ctaccgtgcc . 52320 
tggccatgaa ggaagatttg ttttaaaaaa ttgttttctt taatattaat tgaacacctc 52380 
tg-ttcagagc actgggctgg tgccagaggg tttcagacat gaatcagatc cagcacctca 52440 
tagagcctta atctggcaca cacacacagc cacaaggaga cacagacaag gcagggtagg 52500 
atgagtggaa gctaggagca gatgctgatt tggaacactt ggcttctgca gtgaagcccc 52560 
ttcttagtcc tcttcagtaa cccagctctc agtggataca ggtctggatt agtaagattt 52620 
ggagagatga ttggggattg gggagagctc tctaacctat tttaccacct cctcttctgc 52680 
cattcttcct gtccacatcc ccagcatccc tttcccttgc caagtatctg tggcctctgt 52740 
agtcctttgt aaacagctgt cttcttaccc tacagatcat tgggcaggtg tatgcagatc 52800 
ctgactgcct tccccgaaca ctggactttg gcctcaacgt gaagcttttc tgggagaagt 52860 
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ttgttcccac agattgtccc ccggccttct tcccgctggc cgccatctgc tgcagactgg 52920 
. agcctgagag caggttggta tcctgccttt ttctcccagc tcacagggtc-ctgggacgtt 52980 
tgcctctgtc taaggccacc cctgagccct ctgcaagcac aggggtgaga gaagccttga ' 53040 
ggtcaagaat gtggctgtca acccctgagc catctgacaa cacatatgta caggttggag 53100 
aagagagagg taaagacata gcagcaagta atctggateg gacacagaaa cacagccatt 53160 
aaaagaaagt ttaaaagaag gaaattcacc caaaccattt gaatacagta agtgtattca 53220 
tctttcgata ttcccctgtc catatctaca catatacttt tttttatagt aaatagttct 53280 
gtattttgcc ctgcatttcc cttgtgttta ctatccagtc ttcctgttta tcatttttgt 53340 
cgacaacatg aaattctatt gagagactgt ctgaacatat tgtaatgtag atgttcaggt 53400 
ttttccagtt tctctttaca ataggtattt aactacagtg agcagtttta tgcatttagc 53460 
taatttctcc tttgaggaag tattttcaaa attaccttta ttcttctcag gtaataattt 53520 
cattattacc aaagttaccc taggtctttt caagtgtgtg gttaaaaaac gagaatctgg 53580 
ctgggcgcga tggctcacac ctgtaatccc agcactttgg gaggctgagg ctggtggatc 53640 
acctgaggtc tggagttcga gaccagcctg gccaacatgg tgaaacccca tctctactaa 53700 
aaatacaaaa cttagccagg catggtggca ggtgcctgta accccagcta cttgggaggc 53760 
tgaggcagga gaattgcttg aacccagggg cggaggttgc agtgagccga tatcacgcca 53820 
ttgcactcca gcctcggcaa caagagtgaa actctgtctc aaaaatgggg ttcttttcct 53880 
gccatcaaaa atcatgtttc ttttaaaaac aagttcaaac attaccaaag tttatagcac 53940 
aggaaatacg tcttctgtaa tctcccttaa ccaatatatc cctcaacatt ctcctcaccc 54000 
ccaactccac cctcccagga taaccagttg ggacataatc tttatttaaa aatggtttcc 54060 
ggatagagaa agcgcttcgg cggcggcagc cccggcggcg gccgcagggg acaaagggcg 54120 
ggcggatcgg cggggagggg gcggggcgcg accaggccag gcccgggggc tccgcatgct 54180 
gcagctgcct ctcgggcgcc cccgccgccg ccctcgccgc ggagccggcg agctaacctg 54240 
agccagccgg cgggcgtcac ggaggcggcg gcacaaggag gggccccacg cgcgcacgtg 54300 
gccccggagg ccgccgtggc ggacagcggc accgcggggg gcgcggcgtt ggcggccccg 54360 
gccccggccc ccaggccagg cagtggcggc caaggaccac gcatctactt tcagagcccc 54420 
ccccggggcc gcaggagagg gcccgggctg ggcggatgat gagggcccag tgaggcgcca 54480 
agggaaggtc accatcaagt atgaccccaa ggagctacgg aagcacctca acctagagga 54540 
gtggatcctg gagcagctca cgcgcctcta cgactgccag gaagaggaga tctcagaact 54600 
/agagattgac gtggatgagc tcctggacat ggagagtgac gatgcctggg cttccagggt -54660. 
.caaggagctg ctggttgact gttacaaacc cacagaggcc' ttcatctctg gcctgctgga 54720 
caagatccgg gccatgcaga agctgagcac accccagaag aagtgagggt ccccgaccca 54780 
ggcgaacggt ggctcccata ggacaatcgc taccccccga cctcgtagca acagcaatac 54840 
cgggggaccc tgcggccagg cctggttcca tgagcagggc tcctcgtgcc cctggcccag 54900 
gggtctcttc ccctgccccc tcagttttcc acttttggat ttttttattg ttattaaact 54960 
gatgggactt tgtgttttta tattgactct gcggcacggg ccctttaata aagcgaggta 55020 
gggtacgcct ttggtgcagc tcaaaaaaaa aaaaaaaaat gatttccagc ggtccacatt 55080 
agagttgaaa ttttctggtg ggagaatcta taccttgttc ctttataggc caaggaccgc 55140 
agtccttcag taacaccagt gtaaaagctt gaggagaaat tgtgaagcta cacagtattt 55200 
gttttctaat acctcttgtc attctaaata tctttaattt attaaaaaat atatatatac 55260 
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agtattgaat gcctactgtg tgctaggtac agttctaaac acttgggtta cagcagcgaa 55320 
caaaataaag gtgcttaccc tcatagaaca . tagattctag catggtatct actgtatcat 55380 
acagtagata caataagtaa actatattga atat'tagaat gtggcagatg ctatggaaaa 55440 
agagtcaaga caagtaaaga cgattgttca gggtaccagt tgcaatttta aatatggtcg 55500 
tcagagcagg cctcactgag gtgacatgac atttaagcat aaacatggag gaggaggagt 55560 
aagcctgagc tgtcttaggc ttccggggca gccaagccat ttccgtggca ctaggagcct 55620 
ggtgtttccg attccacctt tgataactgc attttctcta agatatggga gggaagtttt 55680 
tctcctattg tttttaagta ttaactccag ctagtccagc cttgttatag tgttacctaa 55740 
tctttatagc aaatatatga ggtaccggta acattatgcc catttctcac agaggcacta 55800 
ctaggtgaag gagtttgcct gacgttatac aaccaggaag tagctgagcc tagatccctt 55860 
ccacccaccc catggccctg ctcatgttcc acctgcctct aatttacctc ttttccttct 55920 
agaccagcat tctcgaaatt ggaggactcc tttgaggccc tctccctgta cctgggggag 55980 
ctgggcatcc cgctgcctgc agagctggag gagttggacc acactgtgag catgcagtac 56040 
ggcctgaccc gggactcacc tccctagccc tggcccagcc ccctgcaggg gggtgttcta 56 100* . 
cagccagcat tgcccctctg tgccccattc ctgctgtgag cagggccgtc cgggcttcct 56160 
gtggattggc ggaatgttta gaagcagaac aagccattcc tattacctcc ccaggaggca 56220 
agtgggcgca gcaccaggga aatgtatctc cacaggttct ggggcctagt tactgtctgt 56280 
aaatccaata cttgcctgaa agctgtgaag aagaaaaaaa cccctggcct ttgggccagg 56340 
aggaatdtgt tactcgaatc cacccaggaa ctccctggca gtggattgtg ggaggctctt -56400 
gcttacacta atcagcgtga cctggaectg ctgggcagga tcccagggtg aacctgcctg 56460 
tgaactctga agtcactagt ccagctgggt gcaggaggac ttcaagtgtg tggacgaaag 56520 
aaagactgat ggctcaaagg gtgtgaaaaa gtcagtgatg ctcccccttt ctactccaga 56580 
tcctgtcctt cctggagcaa ggttgaggga gtaggttttg aagagtccct taatatgtgg 56640 
tggaacaggc caggagttag agaaagggct ggcttctgtt tacctgctca ctggctctag 56700 
ccagcccagg gaccacatca atgtgagagg aagcctccac ctcatgtttt caaacttaat 56760 
actggagact ggctgagaac ttacggacaa catcctttct gtctgaaaca aacagtcaca 56820 
agcacaggaa gaggctgggg gactagaaag aggccctgcc ctctagaaag ctcagatctt 56680 
ggcttctgtt actcatactc gggtgggctc cttagtcaga tgcctaaaac attttgccta 56940 
aagctcgatg ggttctggag gacagtgtgg cttgtcacag gcctagagtc tgagggaggg 57000 
gagtgggagt ctcagcaatc tcttggtctt ggcttcatgg caaccactgc tcacccttca 57060 
acatgcctgg tttaggcagc agcttgggct gggaagaggt ggtggcagag tctcaaagct 57120 
gagatgctga gagagatagc tccctgagct gggccatctg acttctacct cccatgtttg 57180 
ctctcccaac tcattagctc ctgggcagca tcctcctgag ccacatgtgc aggtactgga 57240 
aaacctccat cttggctccc agagctctag gaactcttca tcacaactag atttgcctct 57300 
tctaagtgtc tatgagcttg caccatattt aataaattgg gaatgggttt ggggtattaa 57360 
tgcaatgtgt ggtggttgta ttggagcagg gggaattgat aaaggagagt ggttgctgtt 57420 
aatattatct tatctattgg gtggtatgtg aaatattgta catagacctg atgagttgtg 57480 
ggaccagatg tcatctctgg tcogagttta cttgctatat agactgtact tatgtgtgaa - 57540 
gtttgcaagc ttgctttagg gctgagccct ggactcccag cagcagcaca gttcagcatt 57600 
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gtgtggctgg ttgtttcctg gctgtcccca gcaagtgtag gagtggtggg cctgaactgg 57660 
gccattgatc agactaaata aattaagcag ttaacataac tggcaatatg gagagtgaaa 57720 
acatgattgg ctcagggaca. taaatgtaga .gggtctgcta gccaccttct ggcctagccc 57780 
acacaaactc cccatagcag agagttttca tgcacccaag tctaaaaccc tcaagcagac 57840 
acccatctgc tctagagaat atgtacatcc cacctgaggc agccccttcc ttgcagcagg 57900 
tgtgactgac tatgaccttt tcctggcctg gctctcacat gccagctgag tcattcctta 57960 
ggagccctac cctttcatcc tctctatatg aatacttcca tagcctgggt atcctggctt 58020 
gctttcctca gtgctgggtg ccacctttgc aatgggaaga aatgaatgca agtcacccca 58080 
ccccttgtgt ttccttacaa gtgcttgaga ggagaagacc agtttcttct tgcttctgca 58140 
tgtgggggat gtcgtagaag egtgaccatt gggaaggaca atgctatctg gttagtgggg 58200 
ccttgggcac aatataaatc tgtaaaccca aaggtgtttt ctcccaggca ctctcaaagc 58260 
ttgaagaatc caacttaagg acagaatatg gttcccgaaa aaaactgatg atctggagta 58320 
cgcattgctg gcagaaccac egagcaatgg ctgggcatgg gcagaggtca tctgggtgtt 58380 
cctgaggctg ataacctgtg gctgaaatcc cttgctaaaa gtccaggaga cactcctgtt 58440 
^gtatctttt cttctggagt catagtagtc accttgcagg gaacttcctc agcccagggc 58500 
tgctgcaggc agcccagtga cccttcctcc tctgcagtta ttcccccttt ggctgctgca 58560 
gcaccacccc cgtcacccac cacccaaccc ctgccgcact ccagccttta acaagggctg 58620 
tctagatatt. cattttaact acctccacct tggaaacaat tgctgaaggg gagaggattt .58680 
gcaatgacca accaccttgt tgggacgcct gcacacctgt ctttcctgct tcaacctgaa 58740 
agattcctga tgatgataat ctggacacag aagccgggca cggtggctct agcctgtaat 58800 
ctcagcactt tgggaggcct cagcaggtgg atcacctgag atcaagagtt tgagaacagc 58860 
ctgaccaaca tggtgaaacc ccgtctctac taaaaataca aaaattagcc aggtgtggtg 58920 
gcacatacct gtaatcccag ctactctgga ggctgaggca ggagaatcgc ttgaacccac 58980 
aaggcagagg ttgcagtgag gcgagatcat gccattgcac tccagcctgt gcaacaagag 59040 
ccaaactcca tctcaaaaaa aaaaa 59065 

<210> SEQ ID NO 4 
<211> LENGTH! 265 
<212> TYPE : PRT 
<213> ORGANISM: Human 

<400> SEQUENCE : 4 

Leu Thr Glu Val Lys Val Met Arg Ser Leu Asp His Pro Aan Val Leu 

1 5 ..." 10 • 15 

Lys Phe He Gly Val Leu Tyr Lys Asp Lye Lys Leu Asb Leu Leu Thr 
20 25 30 

Glu Tyr He Glu Gly Gly Thr Leu Lys Asp Phe Leu Arg Ser Het Asp 
35 40 45 

Pro Phe Pro Trp Gin Gin Lya Val Arg Phe Ala Lys Gly He Ala Ser 
50 55 60 

Gly Met Ala Tyr Leu His Ser Met Cys He He His Arg Asp Leu Aan 
65 70 75 80 

Ser His Asn Cys Leu He Lys Leu Asp Lys Thr Val Val Val Ala Asp 

85 90 95 

Phe Gly Leu Ser Arg Leu He Val Glu Glu Arg Lys Arg Ala Pro Met 
100 105 110 
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-continued 

Glu Lys Ala Thr Thr Lys Lys Arg Thr Leu Arg Lya Asn Aap Arg Lya 
115 120 125 

Lys Arg Tyr Thr Val Val Gly Asn Pro Tyr Trp Met Ala Pro Glu Hat 
,'.130 . 135 140 

Leu Abii Gly Lys Ser Tyr Asp Glu Thr Val Asp He Phe Ser Phe Gly 
145 150 155 160 

He Val Leu Cys Glu He lie Gly Gin Val Tyr Ala Aap Pro Aap Cya 

165 170 175 

Leu Pro Arg Thr Leu Asp Phe Gly Leu Asn Val Lys Leu Phe Trp Glu 
180 185 190 

Lys Phe Val Pro Thr Asp Cys Pro Pro Ala Phe Phe Pro Leu Ala Ala 
195 200 205 

He Cye Cys Arg Leu Glu Pro Glu Ser Arg Pro Ala Phe Ser Lys Leu 
210 215 220 

Glu Asp Ser Phe Glu Ala Leu Ser Leu Tyr Leu Gly Glu Leu Gly He 
225 230 235 240 

Pro Leu Pro Ala Glu Leu Glu clu Leu Asp Hia Thr Val Ser Met Gin 

245 250 255 

Tyr Gly Leu Thr Arg Asp Ser Pro Pro 
260 265 



Tnat which is claimed is: 5. An isolated polynucleotide consisting of a nucleotide 

1. An isolated nucleic acid molecule consisting of a 30 sequence set forth in SEQ ID NOl 
micleotide.sequence selected from the group consisting of: * An . , t , , , .y * * . . 
A\ * ^i-^j *i_ . j . ., »■ An isolated polynucleotide consistmc of a nucleotide 

: ^ -queno, set forth in SEQ ID NO:3, 

(b) a nucleic acid molecule consisting of the nucleic acid „ J'^JTZ jF"*"* '° ^ % J*?™ ^ ^ Ct ° r * 

sequence of SEQ ID NO:l- selected trom the group consisting of a plasmid, virus, and 

r \ i . . • . . ' . . bacteriophage, 
(c; a nucleic acid molecule consisting of the nucleic acid 

sequence of SEQ ID NO-3- and vcctor accordin g to claim 2, wherein said isolated 

(d) a nucleotide sequence that is completely complemen- StfS * ^ ™ P ^ 

tary to a nucleotide sequence of (aW 40 °™ ?S? °° ^ d ° 0rreCt rcadmg frame such that the P< olem of 

2. A nucleic acid vector comprising a nucleic acid mol- 10 N0:2 may ** cx P rcsscd b > a <*U transformed with 
eculc of claim 1. 88,(1 vector - 

3. A host cell containing the vector of claim 2. 5. A vector according to claim 8, wherein said isolated 

4. A process for producing a polypeptide comprising nucleic acid molecule is operatively linked to a promoter 
culturing the host cell of claim 3 under conditions sufficient 45 sequence. 

for the production of said polypeptide, and recovering the 

peptide from the host cell culture. * ♦ ♦ * * 



